Intro Methods RTP
Intro Methods RTP
Table of Contents
An Overview ........................................................................................................................................... 1 Chapter 1. Research, Researchers and Readers ..................................................................................... 3 1.1. Doing Research ............................................................................................................................ 3 1.2. The Need for a Research Report .................................................................................................. 9 1.3. Connecting with the Reader ....................................................................................................... 11 1.4. The Research Project ................................................................................................................. 12 1.5. Example of a Research Proposal................................................................................................ 16 Chapter 2. Formulating the Research Problem and the Research Plan ............................................... 21 2.1. Research Problem ...................................................................................................................... 21 2.1.1. From Topic to Questions..................................................................................................... 22 2.1.2. Research Objectives ............................................................................................................ 24 2.2. Using Sources ............................................................................................................................ 25 2.3. Stating the Research Hypotheses (and Assumptions) ................................................................ 32 2.4. Putting together the Research Plan ............................................................................................ 34 2.4.1. Formulating the Conceptual and Operational Framework .................................................. 35 2.4.2. Using a model ..................................................................................................................... 40 2.4.3. Selecting methodologies ..................................................................................................... 42 Chapter 3. Case Studies from First Research-based Regional Course................................................ 43 3.1 Summaries of Research Proposals ............................................................................................. 43 3.2. Executive Summaries of Selected Final Reports ....................................................................... 46 3.2.1. Statistics for Local Development Planning (Try Sothearith, Cambodia)............................ 46 3.2.2. Access and Quality of Basic Education in Myanmar (Khin Khin Moe, Myanmar) ........... 50 3.2.3. Determinants of Poverty in Nepal (Shib Nandan Prasad Shah, Nepal) .............................. 54 Chapter 4. Data Processing & Analyses .............................................................................................. 59 4.1. Organizing and Summarizing Data ............................................................................................ 59 4.1.1. Getting Started with STATA .............................................................................................. 63 4.1.2. Obtaining Tables and Summary Measures with STATA.................................................... 74 4.1.3. Managing Data with STATA .............................................................................................. 76 4.1.4. Obtaining Graphs and Charts with STATA ........................................................................ 78 4.1.5. Correlation Analysis with STATA...................................................................................... 84 4.2. Performing Statistical Inference................................................................................................. 87 Introduction to Methods for Research in Official Statistics i of 179
4.3. Obtaining and Utilizing Statistical Models ................................................................................ 94 4.3.1. Regression Analysis with STATA ...................................................................................... 94 4.3.2. Logistic Regression and Related Tools with STATA ....................................................... 111 4.3.3. Factor Analysis with STATA ........................................................................................... 123 4.3.4. Time Series Analysis with STATA................................................................................... 130 Chapter 5. Report Writing.................................................................................................................. 147 5.1. Preparations for Drafting ......................................................................................................... 147 5.2. Creation of the Draft ................................................................................................................ 152 5.3. Diagnosing and Revising the Draft .......................................................................................... 155 5.4. Writing the Introduction, Executive Summary and Conclusion .............................................. 158 5.4.1. The Introduction and Conclusion...................................................................................... 159 5.4.2. The Executive Summary ................................................................................................... 162 Chapter 6. Reporting Research Results.............................................................................................. 164 6.1. Disseminating Research Results .............................................................................................. 164 6.2. Preparing for the Presentation .................................................................................................. 165 6.3. The Presentation Proper ........................................................................................................... 169 Appendix Some Grammar Concepts................................................................................................ 171
ii of 179
An Overview
This training manual presents an introductory course on methods of research in official statistics, and is based on the vast literature available on how to go about conducting research effectively. This manual is meant as a learning material for the Research-based Training Program (RbTP) Regional Course of the United Nations Statistical Institute for Asia and the Pacific (UNSIAP). The RbTP objective is to train government statisticians of developing countries in Asia and the Pacific to enhance their capability in undertaking independent research in official statistics and preparing quality statistical reports. The RbTP responds to a training gap for most statistical systems in the developing countries of Asia and the Pacific. Official statisticians especially in National Statistical Offices (NSOs) are typically not very much involved with in-depth analysis of survey and census results and administrative based data. The official statisticians role in analysis for guiding policy is typically that of a passive data provider rather than that of active involvement in the interdisciplinary approach to analysis. While this situation may have been acceptable through the years, data analysis for the purpose of improving data quality and statistical processes should clearly have to be undertaken by NSOs. Statistical offices also need capability in doing research on methodologies to improve statistical processes: addressing problems in survey design, index computations, developing measurement frameworks for new areas in statistics, assisting policy makers and program formulators in the use of statistical models, say for forecasting purposes, and the like. The training manual covers basic research principles and a number of statistical methods for research. The topics included in this manual are: Issues on Doing Research; Formulating the Research Problem and Research Design; Statistical Analyses; Writing and Presenting the Research Paper. Some examples of executive summaries, especially from the First RbTP Regional Course, conducted by the UNSIAP and its local partner the Philippine Statistical Research and Training Center, are also provided. The training manual also provides extensive discussions of the use of the statistical software STATA (release 10) for basic data
1 of 179
processing and analyses, including generating tables and figures, running and diagnosing regression models, and performing factor analyses.
This manual has been prepared under the general direction of Davaasuren Chultemjamts, Director of UNSIAP, with the assistance of Jose Ramon G. Albert. We hope that the course materials presented here are useful for self-learning. Questions, comments and suggestions are most welcome, and should be directed to UNSIAP Director Davaasuren at [email protected] . The thrust of UNSIAP to foster improvements in statistical capacity across the Asia and Pacific Region, and it is hoped that this manual represents a useful step in this direction.
2 of 179
Beyond answering these questions, research must also find explanations to why queries, such as: What are the factors that are related to poverty dynamics? Why are the levels of fertility in certain areas higher than in others? What underlying factors do workers in a farm consider when evaluating their work environment? Scientific knowledge has
nothing to do with right or wrong. Much of what we understand is transient, i.e. never fully completed but more evolutionary paradigms (Kuhn, 1980). Scientific endeavors are consensual; they are the result of shared experiences as part of human endeavors to become better than who we are. They are done in the context of openness, of possibility to scrutiny. There are various typologies of research. We may, for instance, classify research into basic and applied sorts: Basic / Pure / Theoretical or Analytic research which deals with laws, axioms, postulates and definitions in the field of inquiry Applied / Empirical research involving measurements or observations
3 of 179
When the research has no apparent application to a practical, but is of conceptual and scholarly interest to the academic/research community, we refer to it as pure research. Basic research is concerned with organizing data into the most general yet parsimonious laws. The emphasis of the research is on comprehension or understanding so the research contains valid, complete, and coherent descriptions and explanations. Applied research, on other hand, is concerned with the discovery of solutions to practical problems; it places emphasis on factual data which have more immediate utility or application. In basic research, the rationale defines what we want to know about the phenomenon, while in applied research, the rationale defines what we want to do. In carrying out a research inquiry, we can work with qualitative and/or quantitative investigations which depend on the nature of the data: Qualitative research, an inquiry based on information derived from understanding of the behavior of people and institutions, their values, rituals, symbols, beliefs, emotions. This requires a researcher, who is say, studying poverty to immerse him/herself in the lives of the poor by complete participant observation, case study, focused group discussions, and the like. Here, the data gathered maybe in the form of words of the person studied. Quantitative research, based on information derived from surveys, experiments, and administrative records. The raw data from these sources would require analytical methods for systematically evaluating information, including the use of statistical science, which involves summarizing data into meaningful information or drawing conclusions from sample data into meaningful information about the objects that we study. Here the data are in the form of numbers
or a combination of qualitative and quantitative research. This training manual focuses on quantitative research on official statistics.
Research may also be classified depending on purpose. Research may be exploratory; this is undertaken when tackling a new problem/issue/topic about which little is known. At the beginning of an exploratory research undertaking, the research idea cannot be formulated very well. The problem may come from any part of the discipline; it may be a theoretical research puzzle or have an empirical basis. The research work will need to examine what theories and concepts are appropriate, to develop new ones, if necessary, and to assess whether existing methodologies can be used. It obviously involves pushing out the frontiers of knowledge in the hope that something useful will be discovered. Introduction to Methods for Research in Official Statistics 4 of 179
Limits of previously proposed generalizations can also be tested in a research, e.g., Does the theory apply in new technology industries? With working-class parents? Before globalization was introduced? In the wake of a global financial crisis? The amount of testing out to be done is endless and continuous. This is the way to improve a particular discipline. Research may also be of the problem-solving type, where we start from a particular problem in the real world, and bring together all the intellectual resources that can be brought to bear on its solution. The problem has to be defined and the method of solution has to be discovered. The person working in this way may have to create and identify original problem solutions every step of the way. This will usually involve a variety of theories and methods, often ranging across more than one discipline since real-world problems are likely to be messy and not solvable within the narrow confines of an academic discipline.(Phillips and Pugh, 2000) Although we can observe that research may be variegated in purpose, nature of work, and general type, we can identify what are the characteristics of a good research: Research is based on an open system of thought. That is, one can ask any question, even challenge the established results of a good research. Researchers examine data critically. Are the facts correct? Can we get better data? Can the results be interpreted differently?
Researchers generalize and specify the limits on generalizations. Valid generalizations are needed for a wide variety of appropriate situations.
Research involves gathering information for solving a problem; it should go beyond descriptions and should seek to explain, to discover relationships, to make comparisons, to predict, to generalize, to dare to question existing theories and beliefs As research goes about in seeking explanations, the researcher needs to identify causal relationships among the variables being investigated. The logical criteria for causal relationships involve: (a) an operator existing which links cause to effect; (b) cause always preceding the effect in time; (c) cause always implying effect. However, sometimes, the
nature of the design may not allow for a direct establishment of cause and effect relationships. At best, we may only be able to determine correlations that indicate association.
5 of 179
As we go about doing the research, we will have to employ deductive reasoning, inductive reasoning, or a combination of the two. In deduction, laws are stated in universal terms, and reasoning goes from the general to the particular. Induction involves a systematic observation of specific empirical phenomena, with the laws stated in terms of probability. Here, reasoning goes from the particular to the general.
These inquiries are solved in a self-correcting cyclical process entailing deduction, where we reason from observations, and induction, where we reason toward observations (cf. Figure 1-1). Thus, research is a process rather than a linear event with a beginning and an end.
Figure 1-1. The Research Process. The research process is not actually rigid: it involves the identification and formulation of research hypothesis, development of a research design, the collection, analysis and interpretations of data, as well as drawing conclusions from the data analysis and reporting the research findings. Research is based on the testing of ideas that may start with a theory about some phenomenon. That theory might rely on the literature in the discipline. Or, more simply, the researcher might simply be curious about something particular behavior occurring in a particular observation of the phenomenon under investigation, and wonders why this is so. The researcher thinks about the problem, looks at relevant studies and theories in the literature, and tests the explanation by conducting the research. The researcher focuses the question/s of interest into specific testable hypotheses. A research design is made to ensure that data is gathered in a systematic and unbiased manner to answer clearly the research hypotheses that have been posed. After designing the research, data are collected and subsequently analyzed. If Introduction to Methods for Research in Official Statistics 6 of 179
anything unexpected or unanticipated results in the data analysis, alternative explanations for the findings are proposed, that may lead to further data generation, and analysis.
One research problem can spark further examination into the phenomenon under investigation, and as more and more data and analysis accumulate, we gain a deeper understanding of the phenomenon. As knowledge deepens, so do the questions and ambiguities about the phenomenon. Of course within a single research project, there is only a rough linear sequence of events that can happen because of time constraints that lead us to delineate some clear beginning and end to the project, but the research project may set off further research. And so, the conduct of research can be quite challenging.
Research is hard work, but like any challenging job well done, both the process and the results bring immense personal satisfaction. But research and its reporting are also social acts that require you to think steadily about how your work relates to your readers, about the responsibility you have not just toward your subject and yourself, but toward them as well, especially when you believe that you have something to say that is important enough to cause readers to change their lives by changing how and what they think. Booth et al. (1995) Research in official statistics takes a very special track. Within the context of the statistical production process, research needs of official statistics can be viewed in terms of process quality. There may be needs for resolving data quality issues with the aid of imputation, variance estimation, validation and automatic editing. Research concerns go beyond the production side toward concerns from the demand side. For instance, NSOs now have to address data demands for small area statistics especially on matters of public policy such as poverty in the light of monitoring the progress in meeting the Millennium Development Goals (MDGs). A number of official statistics are direct survey estimates that can only be released at rather large areas yet users may require small area statistics for local development planning purposes, and particularly, for localizing the monitoring of the MDGs at the district or sub-district levels. NSOs are trying to address these user needs by experimenting with statistical methods and models that allow survey data to borrow information over space and time among other data sources, such as censuses and administrative reporting systems. Toward improving the timeliness of statistics, NSOs are also investigating the possibility of generating flash estimates, especially to enable policy planners to have rapid assessments of trends relative Introduction to Methods for Research in Official Statistics 7 of 179
to say, food security and hunger. In the area of compilation of data, there are a growing number of research themes pertaining to harmonization of concepts, and improvement of the relevance of classifications arising out of changes in the macro-economy. The development of quality management frameworks in the statistical production process is also getting some attention as a way of enhancing the public trust of official statistics. Traditionally, data analysis and statistical modeling were not viewed as part of the function of statistical offices, but there is a growing recognition for the need to use many data analytic tools and methods in the statistical production process, whether for data validation purposes, for time-series analysis or for statistical visualization (that enables better dissemination of information). A number of statistical models do not directly answer research questions, but may be used as intermediate tools for research, say, to reduce a dataset into fewer variables (including principal components, analysis and factor analysis), or to group observations (including cluster analysis and discriminant analysis) or to perform correlation analysis between groups of variables (including canonical correlation analysis), and the results of these methods may then be subsequently used for other analysis. More and more, statistical research is considered an
essential and integral part of the entire production process of official statistics. The research issues enumerated are certainly not exhaustive of the issues that ought to be addressed, but they serve as examples of some challenges in improving the statistical production process, to enable official statistics to be timely, reliable, and meaningful. Of course, the common person may also distrust statistics (given the many misuses and abuses of statistics). It may also be the case that official statisticians are thought of as those who diligently collect irrelevant facts and figures and use them to manipulate society. No less than Mark Twaine states that There are three kinds of lies: lies, damned lies, and statistics. (See Figure 1-2). A book was even written by Darrel Huff (with illustrations by Irving Geis) in 1954 on How to Lie with Statistics. Despite the view that statistics may be twisted, there is also a sense of the overwhelming importance of statistics, both figures and the science itself. Florence Nightingale is quoted to have pointed out that Statistical Science is ...the most important science in the whole world: for upon it depends the practical application of every other science and of every art: the one science essential to all political and social Introduction to Methods for Research in Official Statistics 8 of 179
administration, all education, all organization based on experience, for it only gives results of our experience.
Figure 1-2. Cartoon on how Statistics are distrusted. Everyday, decisions have to be made, sometimes routinely. In The Histories of Herodutus, the Persians method of decision-making is described as follows: If an important decision is to be made [the Persians] discuss the question when they are drunk and the following day the master of the house. Submits their decision for reconsideration when they are sober. If they still approve it, it is adopted; if not, it is abandoned. Conversely, any decision they make when they are sober is reconsidered afterwards when they are drunk. Such a decision making process may be considered rather strange by todays standards, especially as we tend more and more to formulate evidence-based decisions. In decisionmaking for policy and program formulation and implementation, it is crucial to have official statistics and other information to base decisions upon. Without timely, accurate, reliable, credible official statistics, public policy and the entire development process is blind. That is, without official statistics, policy makers cannot learn from their mistakes and the public cannot hold them and the institutions accountable.
some people have very good memories, most of us may find it hard to remember our findings or may misremember our results unless we write up our research. This is why many
researchers find it useful to even start writing the research report from the beginning of the research project and process. If a researcher writes his research work, it enables him/her also to see more clearly relationships, connections and contrasts, among various topics and ideas. Writing thus enables a researcher to organize thoughts more coherently, and to reflect on the ideas to be presented in the report. Also, the better we write, the better we can also read into what others have written. Thus, writing enables a researcher to improve himself/herself in the communication process.
Even though it is important to write up a research finding, must it be written into a formal paper especially when this task may be quite demanding? Such concerns are reasonable. However, writing research reports into formal research papers enables the research findings to be part of the body of knowledge that people can have access to, especially as a public good. This, in turn, enables others to learn from our research results including the mistakes we have made in our research findings. Formal research reports may be published, and consequently disseminated to a wider readership. They may also be disseminated in other formats, e.g. as unpublished manuscripts but available through the internet.
Writing research into a formal report also helps a researcher arrange the research findings in ways that readers will explicitly understand the information gathered, especially since it takes several drafts to come up with a final report. Researchers will need to
communicate research results with arguments, including claims, evidence, qualifications and warrants. Arguments may involve definitions, analogies and comparisons, cause-and-effect, contributions and impacts.
All statements in a research will have to be structured and linked into a form that will anticipate what are readers views, positions and interests, as well as what readers will question: from evidence to argument. In short, thinking in a written form (especially a formal paper) allows researchers to be more cautious, more organized, more familiar and more attuned to others needs and views different from those of the researcher. Introduction to Methods for Research in Official Statistics 10 of 179
All readers bring to a research report their own predispositions, experiences, and concerns. So before a report is written, it is important for a researcher to think about the standpoint of his/her readers, and where the researcher stands as regards the question being answered. There is undoubtedly going to be variability among readers of the research report. Some readers may have no interest in the research problem so they may not be concerned with the research finding. Some may be open to the problem because the finding may help them understand their own problems. Others may have long-held beliefs that may interfere with the research findings. Some readers may expect the research paper to help them solve their own problems, or understand better the field of inquiry. Some readers may recognize the research problem being investigated, and some may not be aware of this concern. Some may take the research problem seriously, and some may need to be persuaded that the problem matters. Despite this variability, readers share one interest: they all want to read reports that can be understood. Introduction to Methods for Research in Official Statistics 11 of 179
Readers also want to know how a researcher thinks the research will change their ways of thinking and doing things. Are they expected to accept new information, change certain beliefs or take some action as a result of reading the research report? Will the research finding contradict what readers believe, and how? Will readers have some standard arguments against the proposed solution espoused in the research report? Will the solution stand alone or will readers want to see the details in the solution? And how is the research report to be disseminated? Will the report be published in a reputable journal? Will readers expect the report to be written in a particular format? Thus, it is important for a researcher to think seriously of the process of writing and communicating with a reader. The research report is analogous to explaining a path to someone who wishes to travel on a similar journey a researcher has taken, with the researcher as a guide. Researchers certainly share something in common: problems in the writing process pertaining to the uncertainties, confusions, and complexities in the conduct of the research and in coming up with the final draft of the research report. As a researcher engages in the writing process, it is important for a researcher to be aware of all these struggles, and to confront them head on rather than get overwhelmed.
1.4.
involves much planning and writing: from the choice of a problem, to the construction of questions or research objectives, to gathering data relevant to answering your question, to analyzing the data. Most of the writing in the research may involve simple note-taking that merely records what information has thus far been collected. In conducting a research project, a researcher has to typically go through the following steps to come up with a good research report: 1. Identification and Formulation of the Research Problem a. Evaluation of the Researchability of the Problem and the Value of the Research b. Construction of the Research Objectives, including Main Objective(s) and Specific Objectives 2. Review of Literature and Interview of Experts to Get Acquainted with Problem 12 of 179
3. 4. 5. 6. 7. 8.
Review and Finalization of the Research Objectives Construction of the Conceptual Framework of the Study (including Assumptions) Formulation of the Hypotheses Construction of the Operational Framework of the Study Determination of Possible Research Obstacles Construction of the Research Design a. Sampling Design b. Data Collection Method c. Methods for Data Analysis
Data Collection Data Processing Tabulation of Data Analysis and Interpretation of Data Drawing Conclusions and Making Recommendations Reporting the Research Findings
Certainly, the steps above are not intended to be prescriptive nor are they to be undertaken in a linear, sequential fashion. But it helps to see these steps, as well as to look through a few common faults when conducting research. Asis (2002) lists the latter: Collecting data without a well-defined plan or purpose, hoping to make sense of it afterward; Taking a batch of data that already exists and attempting to fit a meaningful research question to it; Defining objectives in such a general and ambiguous terms that your interpretations and conclusions will be arbitrary and invalid; Undertaking a research project without reviewing the existing professional literature on the subject; Use of ad hoc research techniques unique to a given situation, permitting no generalizations beyond the situation itself and making no contribution to the general body of research; Introduction to Methods for Research in Official Statistics 13 of 179
Failure to base research on a sound theoretical or conceptual framework, which would tie together the divergent masses of research into a systematic and comparative scheme, providing feedback and evaluation for educational theory;
Failure to make explicit and clear the underlying assumptions within your research so that it can be evaluated in terms of these foundations;
Failure to recognize the limitations in your approach, implied or explicit, that place restrictions on the conclusion and how they apply to other situations;
Failure to anticipate alternative rival hypotheses that would also account for a given set of findings and which challenge the interpretations and conclusions reached by the investigator.
Much of the issues above could probably be ironed out if before the project commences, a research proposal was drafted to enable the researcher to plan out the course of investigation, i.e., what is to be done; what and how to measure. The research implementation ought to be guided by the research proposal. Researchers typically also write research proposals to enable them to apply for funding, or to receive institutional review and ethics approval. When writing up a research proposal, it has been suggested (see, e.g., Seminar in Research Methods at the University of Southern California as quoted by Asis, 2002) that the following checklist of issues be kept in mind: 1. Basic difficulty: What is it that has caught your interest or raised the problem in your mind? 2. Rationale and theoretical base: Can this be fitted into a conceptual framework that gives a structured point of view? In other words, can you begin from a position of logical concepts, relationships, and expectations based on current thinking in this area? Can you build a conceptual framework into which your ideas can be placed, giving definition, orientation, and direction to your thinking? 3. Statement of the purpose or problem: Define the problem. What is it that you plan to investigate? What is the context of the proposed research? What are the general goals of the study? Why is it important? Introduction to Methods for Research in Official Statistics 14 of 179
4.
Questions to be answered: When the research is finished, what are the questions to which reasonable answers are expected?
5.
Statement of hypotheses or objectives: Spell out the particular research hypotheses you will test or the specific objectives at which the research is aimed. Be concrete and clear, making sure that each hypothesis or objective is stated in terms of observable behavior allowing objective evaluation of the results.
6.
Design and procedure: State who your subjects will be, how they will be selected, the conditions under which the data will be collected, treatment variables to be manipulated, what measuring instruments other data-gathering techniques will be used, and how the data will be analyzed and interpreted.
7.
Assumptions: What assumptions have you made about the nature of the behavior you are investigating, about the conditions under which the behavior occurs, about your methods and measurements, or about the relationship of this study to other persons and situations.
8. Limitations: What are the limitations surrounding your study and within which conclusions must be confined? What limitations exist in your methods or approachsampling restrictions, uncontrolled variables, faulty instrumentation, and other compromises to internal and external validity? 9. Delimitations: How have you arbitrarily narrowed the scope of the study? Did you focus on selected aspects of the problem, certain areas of interest, a limited range of subjects, and level of sophistication involved?
10. Definition of terms: Limit and define the principal terms you will use, particularly where terms have different meanings to different people. Emphasis should be placed on operational or behavioral definitions.
After the research proposal is crafted, and the research commences, then a research report is drafted, revised, and finalized. Thus, doing research involves a number of skills:
15 of 179
Practical skills such as designing , planning and providing a scope to the research work, outlining reports, formatting bibliographies, setting priorities, writing clearly and presenting results;
Research skills such as methods of finding relevant literature; Conceptual skills such as defining research objectives and hypotheses, organizing thoughts into a coherent claims and arguments, making sound analyses and inferences from data, as well as understanding how science works as a collective enterprise
In addition, it is helpful to know how the research report will be assessed by readers. Ultimately, the findings of a research and contents of a research report are the responsibility of the researcher.
Introduction
The Thai government policy promotes voluntary family planning since 1970. Economic and social development together with technological advancement in medical sciences produced longer expectancy of Thai population. Thus, the ratio of elderly people to the total population is increasing.
Objectives
1. To examine demographic, socio-economic characteristics and living conditions of the elderly. 2. To analyze factors influencing employment status of the elderly. 3. To provide basic information on the Thai elderly that might be used in planning elderly help project in the future. 4. To provide information on current situations of the elderly for policy makers. Introduction to Methods for Research in Official Statistics 16 of 179
Framework
Elderly (60 years and over)
Working
Not working
Agriculture sector
Non-agriculture sector
Covariate variable
Dependent variables
Sex Educational attainment Marital status Religion Total living children Living condition Health status Household characteristics Household headship status Owner of dwelling Housing quality index Family support Region Area
Age
Geographic characteristics
Covariate variable
Dependent variables
Age Area
Geographic characteristics
Region
17 of 179
Operational Framework Independent variables Marital status Religion Household headship status Living condition Owner of dwelling Financial support Region and Area
0-16 persons
Employment status
Dependent variable
Ordinal scale
Dummy variable
Industry
Methodology Descriptive statistics and cross-tabulation used to examine demographic and socio-economic characteristics, and living conditions of the elderly GLM Univariate Analysis and Chi-square test used to analyze factors influencing employment status of the elderly, and industry Source of Data The 2002 Survey of the Elderly in Thailand Selected only population age 60 years and over
18 of 179
H : ;r r
2. H 0 : f1 = f 2
: Means of elderly employment status in different financial support categories are equal
H 1 : f1 f 2
: Means are unequal
3. H 0 : h1 = h2 = h3 = h4 = h5
:Means of elderly employment status across different health status are equal
H : ;h h
4. H 0 : e1 = e2 = e3 = e4
: Means of elderly employment status in different educational attainment
H 1 : ei e j ; ei e j
: At least one of the means is unequal to the rest
5. H 0 : h1 = h2
: Means of elderly employment status between different household headship status are equal
1 h1 h2 : Means of elderly employment status between different household headship status are unequal
H :
H 1 : ri rj ; ri r j
2. H 0 : m1 = m2 = m3 : At least one of the means is unequal to the rest : Means of industry across different marital statuses are equal
H 1 : mi m j ; mi m j
3. H 0 : o1 = o2
H 1 : o1 o2
4. H 0 : hi1 = hi 2 = hi 3 = hi 4 : Means of industry between different types of owners of dwelling are unequal
H 1 : hi h j ; hi h j
: At least one of the means is unequal to the rest Introduction to Methods for Research in Official Statistics 19 of 179
20 of 179
21 of 179
Most research topics generate several questions. Questions are of generally of two types: what or where questions and why questions. The grammatical constraints of the direct question force a research to be precise and particular. In the short term, a researcher will find out that the question gives the researcher (and readers of the report) a fairly clear idea of the type of research needed to answer it: quantitative or qualitative; experimental or empirical. Does the research require a survey? Would desk study be appropriate, or is there a need for further data collection? A researcher must progress not only from topic to questions, but from questions to its rationale or significance. Once questions are formulated, a researcher has to ask and try to answer one further question: So what? That is, a researcher must be able to state the value of the research, not only to the researcher, but also to others. In doing so, Booth et al (1995) suggest that a researcher go about the following three steps: a. name the topic, i.e. state he/she is studying about: I am researching ______________ b. imply a question, i.e. state what he/she does not know about the topic: because I want to find out who/how/why ___________ Introduction to Methods for Research in Official Statistics 22 of 179
c. motivate the question and the project: why you want to know about the topic: in order to understand how/why/what _________.
Such steps may seem, at first glance, to be quite academic, and not of any use in the real world, particularly in the world of official statistics. However, research problems in official statistics are actually structured exactly as they are in the academic world. No skill is more important than the ability to recognize a problem important to clients and stakeholders of official statistics, and the public at large, as well as to pose that problem to readers of the researcher report. Readers must be convinced that the research result is important to them and that the research has found a solution to the research problem. A researcher ought to define the research problem very carefully, and to have some idea of why he/she chose it (rather than another problem). Simply identifying a topic which interests a researcher will not help him/her to choose a suitable method of researching it, nor to schedule the research work and report writing. If we make the mistake of not distinguishing between the topic to be investigated and the research problem to be solved, we run the risk of lacking the focus provided by a search for a specific solution to a well defined problem. A researcher may read rather indiscriminately if he/she does not know quite what he/she is looking for, and will probably also make far more notes than is required. The researcher may keep gathering more and more data, and not know when to stop; if he/she does stop, he/she may have difficulties identifying what to include in a research report, and what not to, and worse, decide to put everything into the report and lead readers of the research report to data exhaustion. (Booth, et al., 1995). To raise new questions and new possibilities, to regard old problems from a new angle requires creative imagination and marks real advance in science (Einstein, A. and L. Infeld, The Evolution of Physics, Simon and Schuster, 1938). Thus, in identifying and conceptualizing a research problem, it is important to consider that the formulation of a problem may be far more often essential than its solution, which may be merely a name of a mathematical or experimental skill. As was pointed out by Van Dalen, D.B. in Understanding Educational Research (as quoted by Asis, 2002), a researcher may want to ask himself/herself the following: Introduction to Methods for Research in Official Statistics 23 of 179
Is the research problem in line with my goal expectations and expectations of others?
Am I genuinely interested in this research problem but free from strong biases? Do I possess or can I acquire the necessary skills, abilities, and background knowledge to study this research problem?
Do I have access to the tools, equipments, laboratories and subjects necessary to conduct the investigation?
Do I have the time and money to complete it? Can I obtain adequate data? Does the research problem meet the scope, significance, and topical requirements of the institution or periodical to which I will submit my report?
Will the research study lead to the development of other investigations? Has my research problem been studied before? Am I duplicating somebody elses work? What can I learn from what others have already done on this research problem? What else can I contribute?
Regardless of the specific topic and the approach chosen for investigating the problem, it is a good idea to be as deductive as possible. Even entirely qualitative studies cannot be open-ended, because finances and time for undertaking a research project are limited. At some point, the research has to end. A researcher needs to have some idea of what he/she is looking for, in order to figure out how to go about looking for it and roughly how long it will take to find it. And, of course, so that the researcher will know when he/she has found it!
present what the researcher hopes to accomplish, in clearly defined parts or phases. These objectives are typically in the form of questions.
It is suggested that the general and specific objectives should be smart, measurable, action oriented, realistic and time related (SMART), and where appropriate, a hypothesis must be stated for each objective. The research objectives ought to detail the scope of research work to a rather manageable task. Poorly formulated objectives lead to unnecessary data collection and undoubtedly make the research project difficult to handle. Final outputs and results should be itemized and linked to the objectives. The research objectives may be stated in the form of questions in order to help formulate research hypotheses and methods for meeting the objectives.
do, or what to think. Supervisors can provide ideas, keep the researcher clear of known dead ends or issues on inappropriate use of tools, suggest references and the like but the researcher is responsible for the research undertaking and the writing of the proposal and report. During a presentation of the findings, a researcher should never say something like I did it this way because my research supervisor told me to do so. Instead, a more proper statement is: I did it this way because my research supervisor suggested it; I then compared it with other methods and decided this is indeed best, because....
There are various functions of a review of related literature. A major purpose of using sources is to establish the originality of the research; that is, that the research proposed has not been undertaken. Almost always something related has been done so a review of the literature organizes various sources, discusses them, and points out their limitations. A literature review helps avoid unintentional duplication of previous researches. Of course, there are also
researches that repeat what others have done (or what was done earlier by the researcher) with some slight change in the methodology, as in how constructs were operationalized, or how data were analyzed. In the absence of a research problem, a literature search may also help guide in selecting a topic. It also provides the source of the theoretical and conceptual frameworks of the planned research. A literature search provides ideas on research methodology and on how to interpret research findings. It may help in indicating how the research will refine, extend or transcend what is now known in the body of knowledge about the subject matter. For instance, Mercado (2002) illustrates that when performing an epidemiological research, a careful review of existing sources of information can help determine:
incidence and prevalence- how widespread is the problem? what is the distribution? how often does it occur?
geographic areas affected- are there particular areas where the problem occurs?
changes across population groups- are there special population groups affected by the problem?
26 of 179
probable reasons for the problem- what are probable reasons? Do experts agree on a particular one? what are agreements and conflicting views?
unanswered questions- what are unanswered questions? What problems need further research?
The epidemiological research problem identified must now be defined in terms of its occurrence, intensity, distribution and other measures for which the data are already available. The aim is to determine all that is currently known all about the problem and why it exists.
Sources of information may include people and publications; the latter include general reference works (such as general and specialized encyclopedias, statistical abstracts, facts on file), chapters of books, books, journal articles, technical reports, conference papers, and even electronic sources (both online from the internet and offline from CD ROMs).
Encyclopedias, books and textbooks are typical sources of the conceptual literature; abstracts of journals of various scientific disciplines, actual articles, research reports, theses and dissertations serve as sources of research literature. Sources of both conceptual and research literature include general indices, disciplinal indices, and bibliographies (Meliton, 2002). Sources of information are typically classified into: primary sources : raw materials of the research, i.e. first-hand documents and/or materials such as creative works, diaries and letters, or interviews conducted secondary sources: books, articles, published reports in which other researchers report the results or analysis of their research based on primary sources or data. tertiary sources: books and articles based on secondary sources.
In official statistics, the primary sources may be actual data from a survey, and secondary sources may be published reports, and compilations of data from primary sources. Booth et al. (1995) provide sound advice for using secondary and tertiary sources: Here are the first two principles for using sources: One good source is worth more than a score of mediocre ones, and one accurate summary of a good source is sometimes worth more than the source itself. Introduction to Methods for Research in Official Statistics 27 of 179
Sometimes information from a secondary and tertiary source may misread into the primary source (perhaps inadvertently, and even occasionally, deliberately). This is especially true when we look through information in the internet, which has become a source of a lot of information, both good and not helpful ones. Thus, it is important to also evaluate sources.
It is suggested to consider the reliability of sources in approximately the following order: (a) Peerreviewed articles in international journals; (b) Book chapters in edited collections; textbooks by famous authors (i.e. with a track record of publications in peer reviewed journals); Edited proceedings from international conferences; (c) Technical reports and electronic documents with no printed equivalent from wellregarded institutions; (d) Peer reviewed articles in national or regional journals; Text-books by lesser authors; (e) Unedited conference proceedings; Edited proceedings from local conferences; (f) Technical reports and electronic documents with no printed equivalent from rather obscure institutions.
Taking full notes can provide some guidance toward evaluation of the information and sources. It may be helpful to firstly have bibliographical notes of the reference materials, and put them on cards as in FIGURE 2-1:
Frankfort-Nachmias,Chava and David Nachmias, 1996. Research Methods in the Social Sciences, 5th edition, New York: St. Martins Press. Figure 2-1. Example of a Reference Material on an index card
Such notes should include the author, editor (if any), title (including subtitle), edition, volume, publisher, date and place published of the publication, and the page numbers for an article in a journal. It may also be helpful to record the library call number of the reference material (although this is no longer cited in the final research report). The latter is meant to help trace back steps in case there is a need to recheck the source. For internet sources, information on the website, discussion or news lists, etc. will be important. Introduction to Methods for Research in Official Statistics 28 of 179
Reference notes are typically (see, e.g., Meliton, 2002) of the following forms: (Direct) Quotation The exact words of an author are reproduced and enclosed in quotation marks. It is important to copy each statement verbatim and to indicate the exact page reference so that quotations can be properly referenced in the written report. Paraphrase The researcher restates in his or her won words the authors thoughts. Summary The researcher states in condensed form the contents of the article.
These notes state the author (and title), and a page number on the upper left hand portion of the card (see Figure 2-2 and Figure 2-3). At the upper right hand are keywords that enable a researcher to sort the cards into different categories. The body of the card provides the notes.
Research Problem
A research problem is an intellectual stimulus calling for a response in the form of scientific inquiry. For example, Who rules America? What incentives promote energy conservation? How can inflation be reduced? Does social class influence voting behavior? Figure 2-2. Illustration of a Direct Quote
Hypothesis
A hypothesis is defined as a tentative answer to the research problem. It is defined as tentative because it is verified only after the data come in. Hypotheses should be clear, specific, can be tested using data and existing methods, free from biases Figure 2-3. Illustration of a Synthesis/Paraphrased Entry As the research evolves, it may be possible to have the rate of gathering information go faster than the rate at which the information can be handled, even if the researcher is doing speed reading. And a researcher may experience information overload when the related
literature is overwhelming (see Figure 2-4). A researcher need not include every available Introduction to Methods for Research in Official Statistics 29 of 179
publication related to the research topic or hypothesis. However, it may be important to keep everything one finds in the literature. Even publications that look quite remote from a research project are likely to be useful in the review section of the research report, where readers may get a sense of the extent to which a researcher knows the topic. If a publication will not be cited eventually in the research report, it may even be helpful in a future research undertaking. For primary information, it is important to cite the primary source rather than an interpretation or summary, even if it is in a less accessible or reliable source. For instance, the original census or sample survey should be cited rather than an article on the results of that survey.
Sometimes both should be cited, since the summary may be more readily accessible by the public. It may be important to seek help in planning out the research, including the review of literature, especially from those who can provide both understanding and constructive criticisms. A research supervisor could provide guidance on the length of the literature review section of the research proposal and the research paper. A researcher could use such advice to select sources working backwards from the publications most closely related to the research questions/hypotheses, to those not so closely connected. This selection of related literature involves judgments of importance and value, and so includes a large element of subjectivity. The order and prominence of individual sources vary, depending on the researchers perspective; i.e. according to the question or hypothesis he/she is going to work on. There is also some general preference to use recently published sources, and to use sources that have undergone refereeing. While there may be some apprehension about the extent of subjectivity employed in selecting related literature, this need not worry a novice in research. Since a literature review is meant to contextualize a research project (thereby demonstrating the researchers ability to identify research problems), the selection of related literature may be guided by asking the following broad questions: (a) What exactly has this source contributed? (b) How does that Introduction to Methods for Research in Official Statistics 30 of 179
help, in answering the research question? (c) Is the information provided by an expert? (d) Is the source valid and are there a variety of other sources?. Meliton (2002) even lists a few more specific guide questions for evaluating sources: What is the research all about, i.e., what is the problem? What are the independent and dependent variables, if any?
Who were the respondents? How many? How were they selected? What were they made to do?
What was the research design? What is the experimental treatment used?
What data gathering instruments were used? Were they reliable and valid tools?
What were the major findings? Were they logical? How were the data interpreted? Was the interpretation appropriate and accurate?
The answers to the broad and specific questions above may be subsequently used when writing up the final form of the research proposal as well as in the drafts of the research report.
Ordinarily, a research report will contain a section on the literature review after the statement of the problem and its introductory components but before the theoretical framework. The review must be systematically organized. Similarities and differences should be given among the literature cited. Most or some parts of the research may also be included in the write-up of the research literature, depending on their relevance of the cited literature to the research. For conceptual literature, hypotheses, theories and opinions are usually discussed. (Meliton, 2002) A researcher will have to wonder whether after all the readings and interviews done, is there a better idea as to what the result of the research will be? Introduction to Methods for Research in Official Statistics 31 of 179
current scientific paradigms when he remarked The only time science progresses is when old professors die. A hypothesis is a reasonable first explanation of the true state of nature based on previous researches or first principles, and the research is to be designed to challenge the hypothesis. A hypothesis must: possess sufficient clarity to permit a decision relative to sample or experimental fact lend itself to testing by investigation be adequate to explain the phenomena under consideration allow reliable means of predicting unknown facts be as simple as possible be free of any biases, especially those of the researcher
32 of 179
Hypotheses are typically couched in terms of the variables that are going to be used in the research. A simple example would be: Research Question Are households with large family sizes more likely to be poor than those that have small family sizes? Research Hypothesis (Yes,) Households with large family sizes are more likely to be poor than those that have small family sizes. Notice that this research hypothesis specifies a direction in that it predicts that a large household will be poorer than a small household. This is not necessarily the form of a research hypothesis, which can also specify a difference without saying which group will be better than the other. However, in general, it is considered a better hypothesis if the direction is specified. Also, note the deductive reasoning principle behind testing the research hypotheses. If a theory is true, then controlled experiments could be devised and evidence found to support the theory. If instead data were gathered first and then, an attempt is made to work out what happened through inductive reasoning, then a large number of competing theories could explain the result. This is called post-hoc theorizing, but I this case, there is no way of knowing for certain which theory is correct, and there may be no way of ruling out the competing explanations with the result that we end up choosing the theory that fits best our existing biases. Inductive reasoning has a role in exploratory research in order to develop initial ideas and hypotheses, but in the end the hypotheses have to be tested before they can have scientific credibility. It is important for the researcher not merely to come up with research hypotheses, but also to be able to make sure that findings will manage to either reject or not reject these hypotheses toward identifying a solution to the research problem. Some schools of thought would advocate hypothesis-free research, since just by stating a hypothesis we are actually constructing a context for the research and limiting its outcomes. This is often advocated especially in the social sciences where researchers may immerse themselves in communities and be mere spectators with a suspension of preconceptions to allow the theory to follow the observations. However, there is some concern about this philosophy of research as no person can escape their life experiences, which form an implicit Introduction to Methods for Research in Official Statistics 33 of 179
hypothesis of how things work. It is more sound to state hypotheses explicitly and then design the research to test these hypotheses. In addition, the total ignorance philosophy can be rather wasteful as it ignores what has been done in the literature.
Research hypotheses, unlike assumptions, can be tested. Assumptions are often taken for granted, and not even mentioned, especially established laws. However, there may be assumptions in a research that may be done for convenience or those that can not be tested within time, budget, or design constraints of the research. These assumptions need to be explicitly stated, and they certainly put a scope to the research and its results. The research procedures used should be described in sufficient detail to permit another researcher to replicate the research. The procedures must be well-documented and the methodology/design must be transparent.
34 of 179
Concepts allow scientists to classify and generalize; they are components of theories and define a theorys content as well as attribute; e.g., theory of governance is power and legitimacy. Concepts are foundations of communication. Scientists understand each other when using certain established concepts. Some of these concepts may be variables, i.e., concepts or constructs that can vary or have more than one value. Some variables can be very concrete such as gender and height but others can be quite abstract, such as poverty and well being, or even IQ. In contrast, concepts that do not vary are constants. Example 2.4.1 Definition of Alienation (Nachmias and Nachmias, 1996, pp 32- 33) Conceptual definition of alienation - a sense of splitting asunder of what was once held together, the breaking of the seamless mold in which values, behavior, expectations were once cast into interlocking forms. This conceptualization suggests the following five conceptual definitions: Introduction to Methods for Research in Official Statistics 35 of 179
1. Powerlessness - the expectation of individuals that their behavior cannot bring about or influence the outcome s they desire; 2. Meaninglessness-the perception of individuals that they do not understand decisions made by others or events taking place around them; 3. Normlessness - the expectation that socially unacceptable behavior (e.g., cheating) is now required to achieve certain goals; 4. Isolation-the feeling of separateness that comes from rejecting socially approved values and goals; and 5. Self-estrangement the denial of the image of the self as defined by the immediate group or society Theories can have various functions. Nachmias and Nachmias (1996) propose four levels of theory, viz: Ad hoc classificatory systems arbitrary categories that organize and summarize empirical data; Taxonomies systems of categories constructed to fit empirical observations. Taxonomies enable researchers to describe relationships among categories; Conceptual frameworks descriptive categories are systematically placed in a structure of explicit, assumed propositions. The propositions included in the framework summarize and provide explanations and predictions for empirical observations. They are not established deductively, however. They may be shown in the form of a figure (see, e.g. Figure 2-4). Theoretical Systems Combine taxonomies and conceptual frameworks by relating descriptions, explanations, predictions systematically. The propositions of a theoretical system are interrelated in a way that permits some to be derived from the others. Two main functions of a conceptual framework are to reveal otherwise implicit elements in the system being investigated and to explain phenomena. It is important to use clear terms e.g., point-form, diagram and present the theoretical/conceptual framework to explain the state of the system being investigated. For example, Figure 2-4 illustrates the conceptual framework of political systems. Also, a researcher needs to specify how the program of research is relevant to examining and extending that theory/framework, and show clearly in the research proposal how each part of the proposal addresses a specific feature of that theory/framework. There ought to be some indication about how the theory/framework Introduction to Methods for Research in Official Statistics 36 of 179
will change as a result of the research. Alternatively, what question/s about the theory/framework does the research work answer?
Demands Inputs Support The Political System Decisions Output and Actions
Figure 2-4. Conceptual Framework of Political Systems It is all well and fine to hypothesize relationships and state them in the researchs conceptual framework, but providing evidence for the research hypotheses will remain at a standstill until the constructs and concepts involved are transformed into concrete indicators or operational definitions that are specific, concrete, and observable. Such operational definitions usually involve numbers in them that reflect empirical or observable reality. For example, one operational definition of poverty involves monetary dimensions: a household is considered poor if its per capita income is less than some threshold (say $1 a day in purchasing power parity terms). Most often, operational definitions are also borrowed from the work of others. The most important thing to remember is that there is a unit of analysis: household, individual, societal, regional, provincial/state, to name a few. Thus, aside from formulating a conceptual framework, it is important to also formulate an operational framework, a set of procedures to be followed in order to establish the existence of the phenomenon described by the concepts. The operational framework follows the conceptual framework except that the concepts in the conceptual framework are now replaced by variables or operational measurements of the concepts. There are no hard and fast rules about how to operationalize constructs. In some cases, constructing the operational framework is rather simple and consequently, there is no controversy or ambiguity in what is to be done. In coming up with an operational framework, concepts have to be specified into observable and measurable variables, definitions have to be operationalized, and these Introduction to Methods for Research in Official Statistics 37 of 179
definitions should work for the purposes of the research, depending on the resources available to a researcher or those which can be readily accessed. The operational definitions allow for concepts in the conceptual framework to be specified into the variables or for operational measurements of these concepts. Some operational definitions can easily produce data. It may, for instance, be fairly easy to get someone to fill out a questionnaire if there is no ambiguity in the questions. It may also be important to use operational definitions that are convincing to readers, especially those who will be evaluating the research. It may be important to see help, to have someone criticize the operationalization of the framework and definitions. It is thus often a good idea to use definitions that were used by other researchers (especially noted ones) in order to invoke precedents in case someone criticizes the research. (Of course, the past work of others may also be prone to criticism!). Example 2.4.1: Alienation (contd) To measure powerlessness, we can ask questions in a questionnaire to determine if a person feels powerless-If you made an effort to change a government regulation you believe is unjust, how likely is it that you will succeed? To measure meaninglessness, we can ask questions in a questionnaire to determine if a person understands events around him or her- Who crafted the regulation banning the use of a two-strike engine tricycles? To measure normlessness, we can ask questions in a questionnaire to determine if a person believes that he has to resort to unacceptable behavior to get what he wants- Do you agree or disagree with the statement that : Looting is acceptable in times of famine. To measure isolation, we can ask questions in a questionnaire to determine if a person has a feeling of separateness that comes from rejecting socially approved values and goals- Do you agree or disagree with the statement that : My family will accept me unconditionally when I go home today after a year of living on a red light district. To measure self-estrangement, we can ask questions in a questionnaire to determine if a person has denial of the image of the self as defined by the immediate group or society -Do you agree or disagree with the statement that : What are characteristics that define a good daughter? What of these characteristics do you have?
As the research concepts are operationalized, a researcher must be aware of levels of measurement, i.e., that there is varying precision by which a variable is measured. Following the classic typology of Stevens (1951), anything that can be measured falls into one of the four types; the higher the type, the more precision in measurement; and every level of measurement
38 of 179
contains all the properties of the previous level. The four levels of measurement, from lowest to highest, are, viz., nominal: describes variables that are categorical in nature. The characteristics of the data being collected fall into mutually exclusive and exhaustive categories. Examples of nominal variables include demographic characteristics such as sex, marital status, religion, language spoken at home and race.
ordinal: describes variables that are categorical, and also can be ordered or ranked in some order of importance. Most opinion and attitudinal scales in the social sciences, for instance, are ordinal.
interval: describes variables that have more or less equal intervals, or meaningful distances between their ranks. For example, if a researcher were to ask individuals if they were first, second, or third generation immigrants, the assumption here is that the number of years between each generation is the same. All rates are interval level measures.
ratio: describes variables that have equal intervals and a fixed zero (or reference) point. Weight and age have a non-arbitrary zero point and do not have negative values. Certainly, 50 kilos is twice as much as 25 kilos, and a 15 year old person is one-half as old as a 30 year old person. Rarely do we consider ratio level variables in social science since it is almost impossible to have zero attitudes on things, although qualifications "not at all", "often", and "twice as often" might qualify as ratio level measurement.
The kind of information a measurement gives will show the kind of measurement scale that will be used. That is, the kind of analysis that one can carry out on available data at hand depends on how much information the numbers carry.
39 of 179
oversimplify the problem and may not be well suited to application. The reward for simplifying the understanding of the phenomenon by ignoring what is irrelevant for present purposes is that the model is tractable, i.e. it can be uses to make predictions. Models are beyond mathematical models. Medical researchers typically use some aspect of the physiology of a mouse as a model for physiology of humans. Of course, medical models of persons based on animals can be misleading. These models provide clues that must be tested out in direct investigation with human subjects. In the social sciences, models may consist of symbols rather than physical representations: i.e., the characteristics of some empirical phenomenon, including its components and the relationship between the components, are represented as logical arrangements among concepts. The elements for a model can be drawn from related literature, personal experiences, consulting with experts, existing data sets, and pilot studies. Figure 2-5 illustrates a model for the policy implementation process. This illu stra tes the dyn ami cs Introduction to Methods for Research in Official Statistics 40 of 179
between the policy, implementing organization, the targeted policy beneficiaries, environmental factors thru specific transactions and reactions to the transactions. A model, whether a mathematical or nonmathematical one, captures important features of the object or system that it represents, enough features to be useful for the research investigation.
There are two strategies that can be adopted: either the theory comes before the research (with a deductive approach employed), or the research comes before the theory (with an inductive approach used). When the theory comes before the research: Construct an explicit theory or model. Select a proposition derived from the theory or model for empirical investigation. Design a research project to test the proposition. If the proposition derived from the theory is rejected by the empirical data, make changes in the theory or the research project. If the proposition is not rejected, select other propositions for testing or attempt to improve the theory If the research is undertaken before the theory is formulated: Investigate a phenomenon and delineate its attributes. Measure the attributes in a variety of situations. Analyze the data to determine if there are systematic patterns of variation. 41 of 179
Once systematic patterns are discovered, construct a theory. The theory may be any of the types mentioned earlier although a theoretical system is preferred.
There is a lively controversy as to which strategy is better: theory before research, or research before theory. As shown in Figure 1-2, the research process is cyclical. This iteration between induction and deduction continues until the researcher is satisfied (and can satisfy readers of the research report) that the theory is complete within its assumptions.
Certain researches may be developmental and it is important here to show a level of creativity. That is, the research must create something really new, or at least a new synthesis; it must result in a design that is better than the currently existing alternatives, and the research report must both define and demonstrate this advantage. And to this end, criteria have to be identified for assessing the design. An example of a design research is Sotheariths proposal for a framework for local development in Cambodia during the first RbTP Research-based Regional Course. (cf. Chapter 3). The research report had to establish that there is a demand for local development planning; review existing designs in other countries, such as the Philippines and identify their shortcomings; show the proposed development framework. Ideally, it should have showed an actual test of the framework at the field level. In Chapter 4 of this manual, we look into some mechanical processes and tools of reporting and summarizing data that enable us also to discuss and draw conclusions.
42 of 179
Decentralization of power from the central government to local government is one of a major reform for poverty alleviation and development of democracy in Cambodia. Since this development is young, formed in 2001, most members of the commune council or local government are inexperience, especially in collecting data, build up manageable and reliable database for development plan. Therefore, a need for a systemic data management in each commune for effective and efficient decision-making and development plan is hereby proposed. This research proposal aims to draw the development framework to build the present capacity of the commune council. Specifically; 1. To analyze the present capacity of the elected commune council or local government in systemic statistical management for local development planning; 2. To identify what can the National Institute of Statistics play a more active role in supporting the commune council; and 3. To draw a policy recommendation and framework to establish or strengthen the local database system. One of critical changes of development in Cambodia is to improve the capacity of commune council in data collection and management. There are three concerns that need to be addressed by the framework 1) identify changes in the use data collection instruments in order to address emerging issues in development planning; 2) way of presenting data and analytical techniques for policy decision making; 3) provide an opportunity for the local government to discuss and exchange information on commune issues related to the collection, processing and utilization of data. Risk Coping and Starvation in Rural China (Yu Xinhua, China)
Even though it is widely recognized that giving farmers more secure land rights that may increase agricultural investment, such a policy might undermine the function of land as social safety net and, as a consequence, not be sustainable broad support. This research explores the role of land as a safety net that helps rural households smooth consumption in case of shocks. Likewise, it will explore the impact of different land tenure arrangement of investment and productive efficiency. Therefore, the main objective of this research is to examine how rural household are able to cope with idiosyncratic shocks to their income by suing panel data from Southern China. Combined panel data from annual household survey and cross-section data from field survey will be used to construct an econometric model. This model can be use to evaluate the effect of different land policies. Small Area Estimation for Kakheti Region of Georgia (Mamuka Nadareishvili, Georgia )
In Georgia, various microeconomic indicators lower than a region have not yet been explored. The research proposed to obtain estimates of various indicators at the district level by using the small area estimation technique. The primary data for the research would be the data Introduction to Methods for Research in Official Statistics 43 of 179
from the Integrated Household Survey and Population Census 2002 of Georgio. The data format is in Access. A New Methodology in Calculating Life Expectancy at Birth and its Application in Constructing Life Table in Iran (Mohammad Reza Doost Mohammadi, Iran)
The concept of life expectancy has a direct relation to constructing life table. Life tables are very useful in studying the changes of population according to the effect of mortality in human life. However, this research shall find a way of computing life expectancy without using a life table, and by introducing new variables in the calculation of life expectancy at birth. Review of literature is the main method in this research. Meanwhile, statistical tools such as regression and analytical devices shall also be employed. Access and Quality of Basic Education in Myanmar (Khin Khin Moe, Myanmar)
The basic discussion of the research is to assesses the various programmes of the government on basic education and draw recommendations in promoting education; e.g., free and compulsory by year 2010 and 2015. Education indicators, descriptive statistics, trend lines and related charts, are the methods that will be used in the research. Data will be taken from administrative records of the Myanmar Education Research Bureau, surveys, informal sectors and previous research. Determinants of Poverty in Nepal (Shib Nandan Prasad Shah, Nepal)
The Ten-Year Plan of the government seek to achieve a remarkable and sustainable reduction in poverty by 8%, from 38% of the population at the beginning of the Plan period, and to further reduce to 10% in about fifteen years time. Government and other agencies need reliable information regarding the cause and affecting factors being poor, their location, and their livelihood means. Information on these aspects will help government and other agencies in their efforts to reduce poverty through better policies with better-targeted interventions. Thus, the general objective of this research is to identify the determinants of poverty in Nepal. Specifically, 1) to determine the relationship of household (and household head) characteristics on poverty in Nepal; 2) to identify the causes and affecting factors of being poor in Nepal; 3) to determine the extents of these factors based on the probability of being poor in Nepal; and, 4) to categorize these factors to increase or decrease the chances of being poor in Nepal. Appropriate poverty line based on the latest result of the Nepal Living Standard Survey (1995/96) will be used. The procedure would involve determination of nutrition-base anchor, consumption basket and adjustment of cost of living difference. The factors are categorized under household and household head characteristics. Logit regression will be used to identify the determinants of poverty in different household consumption survey data. The Measurement of Size of Non-Observed Economy in Mongolia - Using Some Indirect Methods (Demberel Ayush, Mongolia)
Unfortunately, an estimate of non-observed economy has not yet been carried out in Mongolian GDP. Thus, this research proposal will study the inclusion of the value added created by non-observed economy in GDP as recommended in the 1993 System of National Accounts. Specifically, the study intend to elaborate the concepts and definitions relating to the coverage activities of non-observed economy, and obtaining the information needed for experimental estimates using indirect method to reflect them in GDP.
44 of 179
Indirect approaches use various macroeconomic indicators such as discrepancy between income and expenditure statistics, the discrepancy between official and actual labor force, transaction amounts and GDP and physical inputs such as electricity consumption. The Mortality Situation of Districts in Central Province of Papua New Guinea (Roko Koloma, Papua New Guinea)
All the mortality indices in Papua New Guinea (PNG) have been estimated indirectly using population census and demographic health survey data. This is done at the national, regional and provincial level. Attempts to derive sub-provincial (districts) has not yet been done so due to smaller sample size, which only catered for national and regional estimates, and the design of questionnaire did not allow for estimates to be made. With the current law in PNG on provincial and local administrative setup, the direction is now more at the district level. The medium term development plan also clearly states the need at district planning as the basis for development planning and meeting these policy goals. Therefore, the research proposal intend to 1) indirectly estimate selected demographic indices (IMR, CMR, Life Expectancy, CDR, ASDR) for the four districts of Central Province; 2) to analyze these indices, compare the trends and patterns with current and past regional/provincial level demographic indices for Central Province; 3) to identify the need for district level demographic indicators in planning and decision making based on this study; 4) to support through this study the establishment of provincial dat systems; to cater for data collection and analysis at district level either through censuses, surveys or administrative records. Basic Needs Poverty Line Analysis (Benjamin Sebastian Sila, Samoa)
With insufficient information/data on poverty in Samoa, assessment on the quality of information provided in the Household Income and Expenditure Survey for 1997 and 2000 shall be undertaken to analyze poverty indicators. Integrated Fisheries Survey Based on Master Sample Methods: An Alternative to Monitor Food Security in the Philippines (Reynaldo Q. Valllesteros, Jr., Philippines)
One of the emerging social concerns under the United Nations Millennium development goals is food security which the Philippine government has been prioritized to address. As such, the Bureau of Agricultural Statistics of the Department of Agriculture gathers information through agricultural and fisheries surveys that were conducted regularly to monitor the level of food sufficiency among the Filipino farmers, fishermen, and the general public. However, appropriated annual budget of the Bureau cannot afford to conduct a regular updating and maintenance of the sampling frame especially for Aquaculture, which requires a list of Aquaculture farm operators by type of aquafarm. Thus, the need to design and construct a master sample frame (MSF) to reduce costs of updating and maintenance for all fisheries survey is of utmost concern. The general objective of the study is to conduct a desk research towards the development of an integrated fisheries survey design based on master sample using the 2002 Census of Fisheries Evaluation Survey (CFES) databases. Specific objectives are; 1) Design and construct a prototype master sample frame (PMSF) for fisheries surveys; 2) Conduct correlation and regression analysis to determine the association of the different characteristics of the operators of municipal/commercial fishing and aquaculture households; and 3) Identify indicators/auxiliary variables needed in the development of a prototype Integrated Fisheries Survey (IFS) design with different modules (aquaculture, commercial fishing, marine municipal and inland municipal fishing,); and 4) Develop a prototype IFS design and recommend for pilot testing in the forthcoming special projects of the Bureau. Developing a Poverty Index Using the APIS, MBN and MDG (Aurora T. Reolalas, Philippines)
The current measurement of poverty indexes such as the poverty incidence, poverty and income gaps, and severity of poverty in the country makes use of the Family Income and Introduction to Methods for Research in Official Statistics 45 of 179
Expenditure Survey, which is conducted every three years. In between FIES years, these measures are not available, instead the annual food and poverty thresholds is estimated using the raising factor derived from the latest FIES. Since, the Annual Poverty Indicators Survey is conducted in between FIES years, the indicators in the APIS would be of use to estimate the poverty indexes in the country. Thus, the main objective of this study is to develop a poverty index using the APIS, MBN and MDG.
Thailand is largely an agricultural country such that most labor force population is in the agriculture sector, which is about 42 percent in 2002. Although poverty problem and the number of poor people got better trend in Thailand, such still exist in the agricultural sector. The dominant of the poor were in agriculture sector, especially the farming household and farm worker. These two groups had head count index 19.1 percent (or around 3.02 million) and 26.3 percent (1.13 million) accordingly. In order to help the policy maker and concerned parties in the government picture out the real scenario of the poor farmers and the surrounding factors influencing their poverty, this study is conducted. Specifically the purpose of the study are to : 1) describe the characteristic of the poor farmers, 2) analyze the poverty incidence, poverty gap, severity of poverty and inequality, 3) analyze the correlation of variables and 4) analyze the factors that influence the poor families. In this manner, those in the government can find solution and provide the appropriate and pertinent assistance/transfers to target group.
The commune, as the lowest level in Cambodias four-tier government structure, was formed after the February 2002 local election to assume the key role of implementor and executor of the national government development program at the local level. It has since played a very important role in local development planning. As such, the commune represents the concretization of the Cambodian governments commitment to decentralization and overall socioeconomic development reform.
Good commune development planning, however, requires having a set of clear and accurate data that could present the problems, priority needs, interests and potentials at the
46 of 179
local level and which may be used not only by and for the commune but also by other local governments like districts, provinces and municipalities as well as by the national government. It also requires having a well-organized and well-trained commune council that would be able to build a good data management system at the local level, especially in terms of data collection, processing and analysis. Objective. It is in this context that this study proposes the formulation of a
development framework meant to build and strengthen the present capacity of the relatively young and inexperienced commune council system in Cambodia to enable it to handle and manage commune-based data and information for development planning.
The framework being proposed is largely modeled after the Philippines experience in decentralization and corresponding development of a community-based system of collecting, processing, analysing, formulating and monitoring data and statistical indicators for planning at the community and local levels. Unfortunately, however, in view of time and funding
limitations, this study simply presents the framework as developed and does not conduct an actual test of the framework at the field level.
Rationale. In order to put the need or call for the proposed development framework for the strengthening of the commune council system in the proper perspective, it is best to present a brief review/overview of Cambodias present statistical system and the manner of collecting data and information, especially for and at the local level.
The National Institute of Statistics (NIS) under the Ministry of Planning (MOP) of Cambodia is mandated to be the focal point of overall statistical matters in Cambodia. It conducts and produces database information at the national level. At the same time, it is tasked to compile and consolidate statistics on various activities collected by the concerned statistics and planning units of decentralized offices and other ministries as well as the data sets established by different international organizations and nongovernment organizations (NGOs) for their specific development purposes. Because of methodological inconsistencies,
differences in definition and coverage, lack of coordination, and at times, poor inter-ministerial
47 of 179
cooperation, however, such consolidation of databases and indicators has not been standardized and widely disseminated. With the current decentralization thrust in Cambodia where the commune takes centerstage, commune-based data that are accurate, timely and relevant become very critical. And while there are actually a lot of information already being collected at present at the commune level such as data sets collected and monitored by the various international organizations and NGOs earlier mentioned, said databases are being used only by and made available mostly to these same organizations for their own needs and purposes. Hardly is there any linkage between them and other relevant end users, in particular, the commune councils and the communes themselves as well as other government agencies involved in local planning. Meanwhile, the SEILA program the Cambodian governments overall national program in the decentralization reform experiment by its very nature and mandate, also collects commune-related data and information. However, because its main concerns relate to the provision of public services and investments, funded through the local development fund (one of SEILAs core financial transfer mechanisms to the communes), SEILAs key information needs focus more on data related to performance evaluation and monitoring, financial status, and project progress and capacity information rather than on data about the basic needs, well-being and living standards of communes. Moreover, the data and information collected by the SEILA program are sent to the provincial level for entry into one of the databases of the SEILA program and are not kept, managed and maintained at the commune level. Whatever data are collected are provided only in report forms to the communes, without turning over the databases to them. In addition, because the SEILA program is a national program that encompasses all levels of government administration, not all of the information and indicators that it maintains are meaningful and useful at each level. For instance, reduction in the level of poverty is a national indicator that is unlikely to be measured at the commune level.
The proposed alternative development framework of statistics: description, phases and methodology. Given the above, this study proposes an alternative system of commune-based data and information called Commune Database Information System (CDIS) whose use will Introduction to Methods for Research in Official Statistics 48 of 179
not be limited only to a few and specific end users, whose information are not narrowed to monitoring and evaluation activities at the province or district or even commune level, and whose information and indicators are specific to the needs of the commune for local planning. The proposed CDIS is a bottom-top system with data generated to depict the communes socio-economic, topographical, demographic, financial and other characteristics or conditions. It is meant to complement available local information system data with primary information gathered directly from the commune. Patterned after the Philippines community-based monitoring system (which spinned off from the Micro Impacts of Macroeconomic Adjustment Policies project in the Philippines), the CDIS will provide the commune with data that can serve as basis in identifying local needs and problems, prioritizing them, and developing programs and projects that can improve the socioeconomic conditions of the community and alleviate the situation of its neediest members. The CDIS consists of three major phases, namely: (a) pre-implementation which
involves the setting up of the CDIS system, determination of agencies involved, identification of the indicators needed, orientation of the commune council and community, organization of the CDIS working group, and preparation of the questionnaire form for the survey; (b) implementation; and (c) analysis and utilization of data. List of indicators. The data requirements identified for the CDIS framework are
basically those needed for local situation analysis/development planning broadly classified into nine categories: (a) demography, (2) housing, (3) education, (4) health/nutrition, (5) peace and order/public safety, (6) income and livelihood, (7) land use, (8) agriculture, and (9) transportation. From these, 49 indicators meant to meet the minimum basic needs at the commune level have been identified. The indicators serve as basic information about the commune and the families therein, thus enabling action to be taken to address improvements in development and planning at the commune level. Implications and conclusion. With the key role that the commune councils play in the current Cambodian decentralization thrust, the development of a CDIS all the more becomes important. It is, after all, proposed to be a sangkat 1-based information system for gathering,
49 of 179
analysing and utilizing data regarding the basic needs of local residents. Moreover, it is a system where certain members of the commune council and the community themselves are proposed to be members of the CDIS team which will gather and handle the raw data and take charge of data updating on a regular basis. The data are also to be processed, analysed and maintained at the commune level with copies submitted to the district and provincial planning and statistics levels for integration into their development plans. It thus provides a functional organization at the commune level that becomes part of the overall information system and at the same time generates information that enable the community and other agencies involved to take immediate local action. To institutionalize the CDIS would require the participation of all levels of the government. The bottom-up approach has greater chances to succeed, for instance, if agencies like the NIS and the MOP are to take part at the very outset of the CDIS development because these are the two agencies that will eventually take over the role of the SEILA program once the latters mandate is completed. These two agencies can thus develop the CDIS to become the national standard database system for all agencies involved in development planning for communes, districts, provinces/municipalities and national level as a whole.
3.2.2. Access and Quality of Basic Education in Myanmar (Khin Khin Moe, Myanmar)
Good quality basic education is fundamental to acquiring relevant knowledge, life skills and understanding of ones environment. The good personal qualities acquired would have a significant impact on individual productivity, quality of life and economic competitiveness. As such, basic education may be considered as the foundation for national development and economic growth.
Recognizing this, the Myanmar government has set basic education as the center and focus of its Education For All (EFA) program. Its EFA National Action Plan is aimed toward the improvement, especially in terms of access, quality, relevance and management, of its primary and lower secondary levels which are the heart of basic education. Introduction to Methods for Research in Official Statistics 50 of 179
Myanmars overall education goal, which is embodied in the EFA program, is to create a system that can generate a learning society capable of facing the challenges of the knowledge age. In particular, the EFA program aims to: (a) ensure that significant progress is achieved so that all school-aged children shall have access to complete, free and compulsory basic education of good quality by 2015; (2) improve all aspects relating to the quality of basic education such as teachers, personnel and curriculum; and (3) achieve significant improvement in the levels of functional literacy and continuing education for all by 2015. Are these goals being achieved? To answer this, it is important to see if access to and the quality of basic education in Myanmar are improving, and to determine the trends of certain educational indicators through the years. Objectives. This paper therefore aims to respond to this concern by assessing the status of basic education in Myanmar and by examining the trends of 19 education indicators from 1991 to 2001. It likewise endeavors to show the education indicators that would experience increases with an increase in the government expenditures on education and thereupon presents projections on them. Definition, data and methodology. Basic education is defined in this study as
composed of the primary and secondary school levels. In Myanmar, the basic education system is referred to as 5-4-2, meaning that it consists of 5 years of primary school, 4 years of middle school (lower secondary level) and 2 years of high school (upper secondary level). The entry age for the school system is 5 years old. All schools and universities in Myanmar are run by the government but for primary and middle education, monasteries also offer them with the same curriculum. Nineteen (19) education indicators are used in the assessment, namely: (1) public expenditure by social sector, (2) public expenditure on education as a percentage of GDP, (3) public expenditure on education as a percentage of total government expenditure, (4) total number of villages and percentage of villages with schools, (5) number of schools in basic education, (6) pupil-teacher ratio, (7) ratio of students to schools, (8) gross enrolment ratio, (9) net enrolment ratio, (10) percentage of female students, (11) number of students in monastery education, (12) transition rate, (13) retention rate, (14) repetition rate, (15) promotion rate, (16) Introduction to Methods for Research in Official Statistics 51 of 179
dropout rate, (17) internal efficiency of primary education, (18) mean years of schooling per person aged 5 and over, and (19) adult literacy rate. Data are taken from the Myanmar Education Research Bureau, Department of Education Planning and Training, Department of Population, Planning Department, General Administration Department, and the Department for the Promotion and Propagation of the Sasana. Descriptive analysis is used to study trends and patterns over time while correlation analysis and regression modeling are used to look into the relationship between the education indicators and government expenditure and to determine how government will be spending on education in the near future. Results/findings. Based on the trends established for certain education indicators from 1991 to2001, the study asserts that Myanmar has achieved the goals stated in its EFA action plan. More specifically, for goal one where access and complete free and compulsory basic education of good quality is being targeted for all school-aged children by 2015, the trends for all indicators, except for the repetition and dropout rates, are shown to be increasing, suggesting an improvement in the access and quality of basic education for all school-aged children and the full attainment of such goal by 2015. The continuing increase in the trends for public expenditure on education as a percentage of both the GDP and total government expenditure is especially noteworthy since it shows the priority given by the government to education. Ditto with the overall increases in the trends in the number of schools in basic education as well as in the percentage of villages with schools. The upward trend indicates the provision of more and better social services for the sector. The increases in the retention, promotion and internal efficiency rates, accompanied by decreases in the trends of the repetition and dropout rates, meanwhile, support the suggestion that the quality of basic education has been improving. For goal two in terms of improving all aspects of the quality of basic education, the pupil-teacher ratio trend line as shown in the analysis indicates an improvement in the access of pupils to teachers, thereby supporting the observation that teachers who play a key role in Introduction to Methods for Research in Official Statistics 52 of 179
bringing about national development are available to educate children and adults and shape their attitudes and outlooks. Unfortunately, however, there are no data to show the quality of the curriculum. Nonetheless, it should be noted that administrators and managers of the Department of Basic Education are said to regularly conduct curriculum evaluation and revise it accordingly to meet the changing needs of the country. And for goal three, the continuing increase in the trend line for adult literacy rate is one indicator of how Myanmar tries to effect improvements in the levels of functional literacy and continuing education for all its citizens. Myanmars efforts in this aspect have, in fact, been recognized by the UNESCO when Myanmar was awarded international literacy prizes twice for it. Meanwhile, the correlation analysis of education indicators with education expenditures shows the indicators that are expected to increase given an increase in the government expenditure on education. As such, the results help the government identify which among the education indicators it needs to closely monitor in order to see whether or not it is meeting or achieving its various EFA goals. For instance, the analysis shows that among others, the indicators of gross enrolment ratio and net enrolment ratio have high positive relationship with expenditures on education. This means that for every increase in the expenditure for education, these two indicators are expected to register increases. And because access to education is a situation wherein there is ease of enrolment in the first grade of every school level, then the extent of access may be measured by these indicators. Monitoring these indicators therefore enables the authorities to assess how it is faring in its attainment of the EFA goal of providing all school-aged children with access to basic education. Finally, the results of the studys analysis indicate that on the whole, Myanmars basic education has been managed well. Moreover, the results of the projections suggest that indeed, it will do Myanmar well if it will continue to spend or invest on education as this will bring further improvements in the education indicators that will help the country attain its EFA goals by 2015.
53 of 179
3.2.3. Determinants of Poverty in Nepal (Shib Nandan Prasad Shah, Nepal) Nepal is considered to be one of the poorest countries in the world. With a per capita income estimated at US$200 in 1995, it is ranked, based on a 1999 World Bank report, as the ninth poorest country in the world, the poorest in fact outside of Africa. In terms of the human development index computed by the United Nations, it is also almost at the bottom at number 143 out of 175 countries as per the 2003 global Human Development Report. The overall living standard of the Nepalese people has thus remained very poor, with majority of the population residing in the rural areas where a big segment is poor. To help improve the living standards, most of the governments economic development plans and programs have focused on agricultural and physical infrastructure development. In order to determine, however, whether or not such plans and programs are succeeding and meeting the United Nations Millennium Development Goal (MDG) of reducing extreme poverty and hunger by 2015, it is important to have solid and reliable information on the causes of poverty, where the poor are located and what their livelihood means are. Said information are helpful in monitoring the progress in poverty reduction efforts and in designing better targeted interventions directed towards reducing poverty. Objectives and rationale. In view of the above, this study aims to identify the
determinants of poverty in Nepal. In particular, it aims to present a basic poverty profile in Nepal, examine the relationship of household and household head characteristics on poverty in Nepal, identify the correlates of poverty and assess the extent of their effect on the probability of being poor in Nepal, and run poverty simulations. Poverty alleviation has always been the overriding thrust in Nepals development efforts. Yet, despite noticeable progress, with poverty rate declining from 42 to 38 percent over the past decade, widespread poverty still remains. In this regard, the government has made a renewed commitment to bring down the poverty level in Nepal from the baseline figure of 38 percent at the beginning of the renewed commitment plan period (2002) to 30 percent by the end of the plan period in 2007. Such goal is a daunting task by itself because the problem of poverty has persisted for decades. Being a deeply rooted and complex phenomenon, poverty cannot easily be 54 of 179
eradicated. There are no quick and easy solutions unless one is able to identify the factors that are associated with being poor or the so-called determinants of poverty in Nepal. Description, data and methodology. Nepal is nestled between two populous countries. To its east, south and west is India while to its north is China. It is a landlocked country that sits in the lap of the Himalayas. Geographically, Nepal is divided into 3 regions, namely, (a) Mountain, (b) Hill, and (c) Terai. Seven (7%) percent of the population live in the Mountain area while 44 and 49 percent, in the Hill and Terai areas, respectively. There are 5 development regions and 75
administrative districts, the latter being further divided into smaller units called village development committees (VDCs) and municipalities. Poverty varies between urban and rural areas, and across geographical regions, with rural poverty at 44 percent, higher by almost two times than urban poverty (23 percent) and with poverty more pronounced in the Mountain areas (56 percent) than in either the Hill (41 percent) or Terai (42 percent) areas. Basic data for this study are taken from the Nepal Living Standard Survey (NLSS) of 1995-96 which consists of a sample of 3,388 households. The sample is distributed among the Mountain area (424 households), Urban Hills (604 households), Rural Hills (1,136 households) and Terai area (1,224 households). For its framework, the study draws from the various household characteristics as well as household head characteristics covering demography, education, economic, employment, food consumption, household property and social facilities as explanatory variables to determine the correlates of poverty and assess how such influence the probability of a households being poor. At the same time, the study examines how the probability of being poor changes as the household head characteristics change. The study develops a poverty profile of Nepal with the use of the three main indices of poverty, namely, the headcount index, poverty gap index and poverty severity index. Based on the profile, some household and household head characteristics are initially hypothesized as factors affecting a households per capita consumption. With the help of a multiple regression model, the correlates of poverty are then identified. Going beyond, the study then carries out a Introduction to Methods for Research in Official Statistics 55 of 179
poverty simulation exercise to predict reductions or increases in general poverty levels resulting from unit changes in selected aggregate household or community characteristics. Certain caveats, though, have to be taken into consideration. For one, there appears to be some degree of measurement error in some variables in the data from the NLSS which thereupon compromises the scope and quality of analysis of this study. Also, some other potential determinants of poverty are not included in the data collected and information on some variables like ethnicity, religion and household head are not complete for all households. Results/findings. The profile drawn by the study on Nepals poor yields, among others, the following characteristics. Geographically, the poor are shown to be residing mostly in the rural areas, with the poverty rate there being double (44 percent) than that in the urban areas. Poverty is also more severe in rural than urban areas. Across regions, the mid-western and far western regions appear to have higher poverty incidence as well as more intense and severe poverty than other regions. In terms of demographic characteristics, all 3 poverty indices (headcount index, poverty gap index and poverty severity index) are shown to be higher as household size increases. Thus, it may be surmised that households with a large household size are more likely to be poor than those with a smaller size. As to ethnicity and religion, the analysis indicates that the occupational caste groups such as Kami, Magar and Damai are poorer than other castes, and that Muslims are more likely to be poor than those of other religions. However, poverty gap and severity are higher among Buddhists. With regard to household head characteristics, education, gender, marital and employment status are shown to have significant effects on poverty status, with literate and female heads, for instance, being able to manage their incomes better and therefore have less chances of being poor. For social amenities/facilities, on the other hand, households with no access to safe water, no electricity and no toilets are counted as poor. Meanwhile, the results of the modeling and simulation exercises seem to strengthen the picture drawn from the profile. At the same time, they indicate the variables with the highest impact on the program of poverty alleviation. The major conclusions indicate that the 56 of 179
following are strong determinants of living standards and their impacts, as based on an increase of one Nepalese rupee in support of their investment, lead to large improvements in the living standards and reduction of poverty in the country: 1) Decrease in demographic characteristics such as household size and the composition of dependent members in a household; 2) Increase in some education variables such as adult female literacy, maximum level of education attained by an adult, literacy rate and average number of years of schooling of a parent; 3) Increase in employment opportunities for household members and more self employment in and outside of agriculture; 4) Acquisition of household properties such as farmland with modern technology and own dwelling; and 5) Increase in access to social and physical infrastructure such as schools, markets, banks, health posts, roads and the like as well as access to safe drinking water.
Recommendations.
Government efforts must therefore be geared toward lowering dependency within households, reducing household size, improving the literacy of adult females, providing more opportunities for employment to the working age population, reducing the mean time to access social and general infrastructure in equal manner in all regions and the rural sector, and avoiding gender discrimination in the case of education. These efforts all have large impacts in improving the Nepalese standard of living. In sum, one of the most important determinants of poverty, as shown from the results, is education or its lack thereof. More education, particularly in the higher or tertiary level of education, for the people enlarges the likelihood of their finding work or employment opportunities and thereby earning more money. As such, government must invest more on education programs. And while investments in education are inherently of long gestation, the simulations nonetheless show that they can be a powerful instrument in the long-term fight against poverty. Introduction to Methods for Research in Official Statistics 57 of 179
Ditto with the provision of more employment opportunities. With more employment comes more income and thereupon more consumption. In terms of reducing household size and the number of dependent household members, meanwhile, the government must promote national awareness programs for reducing fertility and providing more employment. Finally, because the model used in the study is of a static nature, the simulations provide no indication of the time frame for changes and improvements to take place and they might only be felt after a long gestation period. Nonetheless, the analysis provides Nepalese policy planners with objective measures on the potential poverty reduction impacts which might be realized from certain key sectoral strategies. Planners must therefore view the results of this study as a possible guide in allocating resources for poverty reduction based on a more informed analysis
58 of 179
Various univariate and multivariate statistical methods (cf. Tables 4-1 and 4-2) can be used to answer specific questions, such as: Can the patterns observed in a sample data be generalized to the entire population from which the data were collected? Is a (dependent) variable Y linearly related to other (independent) variables? Can the behavior of the variable Y be (statistically) significantly explained by the independent variables?
59 of 179
Table 4-1. Commonly-used Univariate Statistical Methods Statistical Method/Model Dependent Variable Independent Variable T-test Quantitative interval None (one population) Median-test Quantitative interval or None (one population) Ordinal Classical Regression Quantitative All Quantitative Classical Regression With Dummy Quantitative Some Quantitative Variables to Represent Qualitative Some Qualitative Variables Emphasis on Quantitative Classical Regression Quantitative All Quantitative Regression with Autocorrelated Errors Cointegrated Data Classical Regression and Regression Quantitative Some Quantitative with Autocorrelated Errors With Dummy Some Qualitative Variables to Represent Qualitative Variables Two-sample T-test Quantitative interval Qualitative (2 levels ) Wilcoxon-Mann-Whitney-test Quantitative interval or Qualitative (2 levels ) Ordinal Analysis of Covariance (ANCOVA) Quantitative Some Quantitative Some Qualitative Emphasis on Qualitative Analysis of Variance (ANOVA) Quantitative All Qualitative Logistic Regression (An Alternative is Qualitative Some Quantitative Discriminant Analysis) Some Qualitative Loglinear Models ( Chi-Square test of Qualitative All Qualitative Independence is a special case) Table 4-2. Commonly-used Multivariate Statistical Methods Statistical Method/Model Dependent Variable(s) Independent Variable Multivariate Regression Quantitative All Quantitative Multivariate Regression With Dummy Variables to Represent Qualitative Variables Multivariate Analysis of Covariance (MANCOVA) Multivariate Analysis of Variance(MANOVA) Generalized Repeated Measures Model Quantitative Some Quantitative; Some Qualitative Emphasis on Quantitative Some Quantitative Some Qualitative Emphasis on Qualitative All Qualitative Some Quantitative Some Qualitative All Qualitative All Quantitative None None
All Quantitative
All Quantitative Quantitative Variable Measured at Different Points in Time Some Quantitative; Some Qualitative Two or More All Quantitative Two or More (All Quantitative) Two Sets of Two or More (All Quantitative)
Some statistical methods of analysis do not necessarily answer questions directly, but may be used to reduce a dataset into fewer variables (including principal components, analysis Introduction to Methods for Research in Official Statistics 60 of 179
and factor analysis), or to group observations (including cluster analysis and discriminant analysis) or to perform correlation analysis between groups of variables (including canonical correlation analysis). Other statistical tools are used for specific purposes or types of data, including survival analysis models (for analysis of failure or lifetime data), artificial neural networks and time series models (for forecasting) and geostatistical models (for analyzing spatial data). All the statistical methods of analyses to be discussed in this chapter help in answering some research questions and in transforming raw data into meaningful information. Various types of data will require various types of analyses, even when data are summarized. For instance, merely averaging data which take a categorical nature may not be meaningful. Cross-section regressions, time series models and panel data analyses need to undergo validation to ensure that the models will provide useful insights into and forecasts of a phenomenon. In this day and age, statistical analysis can readily be performed with the aid of computer packages. Some basic statistical methods, including correlation analysis and simple linear regression can even be performed on electronic spreadsheets, such as Microsoft Excel, which were developed as interactive calculators for organizing information defined on columns and rows. (See Figure 4-2). Be aware though that Excels statistical features have limitations.
We can also employ a number of statistical software packages, such as STATA (http://www.stata.com), SAS (http://www.sas.com), SPSS (http://www.spss.com), and R Introduction to Methods for Research in Official Statistics 61 of 179
(http://cran.r-project.org) that provide pre-programmed analytical and data management capabilities. There are also a number of statistical software that can be used for special purposes. The US Bureau of the Censuss two data processing software IMPS and CS-PRO, can respectively be downloaded from http://www.census.gov/ipc/www/imps/ and
http://www.census.gov/ipc/www/cspro/index.html, while the variance estimation software CENVAR can be obtained from http://www.census.gov/ipc/www/imps/cv.htm. The current competitor to CS-PRO is EPI-INFO (http://www.cdc.gov/epiinfo/) that was developed by the US Center for Disease Control. Other special purpose software include the seasonal adjustment free software X12 (http://www.census.gov/srd/www/x12a/) from the US Census Bureau, and TRAMO-SEATS (http://www.bde.es/servicio/software/econome.htm) from the Bank of Spain, the software WINBUGS for Bayesian analysis with the Gibbs Sampler, (http://www.mrcbsu.cam.ac.uk/bugs/winbugs/contents.shtml) and the econometric modeling software Eviews (http://www.eviews.com/eviews4/eviews4/eviews4.html). As Table 4-3 illustrates, statistical software may also be categorized by cost. Commercial software typically requires a license that needs to be paid for either yearly or at a one-time cost. These costs currently range from about 500 to 5000 US dollars. Freeware, on the other hand, can be readily downloaded over the web without cost, but usually offer less technical support and training.
Purpose Cost
Commercial Freeware General SAS, SPSS, STATA R Special Eviews (Econometric Modeling) WINBUGS (Bayesian Modeling); X12 (Seasonal Adjustment); IMPS and CENVAR (Survey Data Processing and Variance Estimation), Epi Info
The popular commercial statistical software include SAS, SPSS, STATA, and Eviews, while Epi Info, R, IMPS, CENVAR, TRAMO-SEATS, X12 and WINBUGS are free statistical software. Commercial software may be advantageous to use over freeware because of support and training. Introduction to Methods for Research in Official Statistics 62 of 179
Training may be in the form of formal face-to-face courses (which can be expensive), web-based and/or distance learning courses, or documentation that is comprehensive enough to use as a training manual. Typically, documentation is only a reference manual, and the software user must learn how to use the package from a formal training course or through someone familiar with the software. In this training manual, we focus on the use of the commercial statistical software STATA (pronounced stay-tah) version 10 since among the commercial statistical software identified, this is considered rather cost-effective and is designed for research, especially on data generated from complex survey designs (as is done in many NSOs).
Figure 4-3. Four main STATA windows: Review window (upper left hand corner), Variables window (lower left hand corner); Results window (upper middle corner); Command window (lower middle corner) A variable name in the variables window can be pasted to the Command line window or an active dialog field just by merely clicking on the variable name. Also, if you click a command in the Review window, it gets pasted to the Command window where you can edit and execute the edited command. You may also scroll through past commands in the Review Introduction to Methods for Research in Official Statistics 63 of 179
window using the PgUp and PgDn keys. Past commands in the Review window may also be saved into a do-file by clicking the upper left Review window button and selecting Save Review Contents. The size, shape and locations of these windows may be moved about on the screen by left-clicking and dragging the window with a mouse. However, note that it is useful to have the Results window be the largest in order to see a lot of information about your STATA commands and output on the screen. If you are creating a log file (see below for more details), the contents can also be displayed on the screen; this is sometimes useful if one needs to back up to see earlier results from the current session. The fonts or font size may be changed in each window by clicking the upper left window button (or right clicking on the window) and then choosing Font on the resulting pop-up window. Finally select the desired font for that type of window or select a fixed width font, e.g. Courier New 12 pt. When finished, make your choices the default by clicking on: Prefs Saving the windowing preferences If the settings were lost, you can easily reconstruct them by clicking on: Prefs Load windowing preferences By default the Results window displays a lot of colors, but you can readily choose a set of predefined colors, or customize the colors. Henceforth, we use white background color scheme for the results window. In addition to these four major windows, the STATA computing environment has a menu and a toolbar at the top (to perform STATA operations) and a directory status bar at the bottom (that shows the current directory). The menu and toolbar can be used to issue different STATA commands (like opening and saving data files), although most of the time it is more convenient to use the STATA Command window to perform those tasks. Unlike previous versions, STATA version 8 has a graphical user interface (GUI) that allows users to directly perform data management (with the Data Menu), obtain some data analysis (with the Statistics Menu) or generate all sorts of graphs on a data set (with the Graphics menu) with point-andclick features. In addition, note that STATA yields a viewer window when online help is requested from the menu bar. Introduction to Methods for Research in Official Statistics 64 of 179
STATA is actually very easy and intuitive to use once you get used to its jargon and formats. The syntax of the commands can be readily learned, perhaps more easily than other statistical packages syntax. STATA has both a command driven interface and a graphical user interface (GUI), i.e., point and click features. With its command driven interface, actions, e.g., summary tables and graphs, are set off by commands which can be issued either in interactive or batch mode. With the interactive mode, users can issue commands line by line and yield results after each command line is issued. Commands are also retrieved with the Page-Up and Page-Down keys or retrieved from the list of previous commands. With the batch mode, users can run a set of commands and have the results all saved into a file (or shown on screen). Documentation of the data management and analysis is easy to do in contrast to working with spreadsheets (where we can do many things, but where the documentation of what we did including the sequence of actions may be difficult). A number of econometric routines are available in STATA. The range of statistical methods and tools provided in the software allow us to perform descriptive and/or exploratory research, to discover and explain relationships, to make comparisons, to generalize and forecast, to confirm or possibly question hypothesis or existing theories. The statistical methods that can be run in STAT include classical and logistic regression, simultaneous equation methods, nonparametric curve fitting, statistical models for analyzing ordinal, count, binary and categorical outcomes, a number of multivariate routines such as cluster analysis and factor analysis, various econometric methods for analyzing time series data, panel data and multivariate time series, including ARIMA, ARCH/GARCH, and VAR models. STATA also allows users to add their own procedures. New commands and procedures are regularly distributed through the net and/or the Stata Journal. STATA will thus perform virtually all of the usual statistical procedures found in other comprehensive statistical software, and its strength lies on the commands on survival analysis, panel data analysis, and analysis of survey data. STATA readily accommodates survey data with designs more complex than SRS such as stratified designs and cluster sampling designs. Unlike the survey software SUDAAN (http://www.rti.org/patents/sudaan.html), STATA now is able to handle several stages of Introduction to Methods for Research in Official Statistics 65 of 179
commands (called svy commands) for various analyses of survey data, including including svy:mean, svy:total, svy:ratio, and svy:prop for estimating means, totals, ratios, and proportions. Point estimates, associated standard errors, confidence intervals, and design effects for the full population or subpopulations can be generated with these commands. Statistical methods for estimating population parameters and their associated variances are based on assumptions about the characteristics and underlying distribution of a dataset. Statistical methods in most general-purpose statistical software assume that the data meet certain assumptions especially that the observations are generated through a simple random sample (SRS) design. Data collected through sample surveys, however, often have sampling schemes that deviate from these assumptions. Not accounting for the impact of the complex sample design can seriously lead to an underestimate of the sampling variance associated with an estimate of a parameter. The primary method used for variance estimation for survey estimates in STATA is the Taylor-series linearization method. There are, however, also STATA commands for jackknife and bootstrap variance estimation, although these are not specifically oriented to survey data. Note that several software packages for analyzing survey data such as SUDAAN and PC Carp ( http://www.statlib.iastate.edu/survey/software/pccarp.html ) use the Taylor Series approach to variance estimation. Replication approaches to variance estimation are used in WesVar and WesVarPC ( http://www.westat.com/wesvarpc/index.html ), and VPLX ( http://www.census.gov/sdms/www/vwelcome.html ). Two popular general-purpose statistical packages, viz., SAS and SPSS, have recently developed the capacity to analyze complex sample survey data. There is an ongoing debate as to whether the sample design must be considered when deriving statistical models, e.g., linear regression, logistic regression, and probit regression, tobit, interval, censored, instrumental variables, multinomial logit, ordered logit and probit, and poisson regression, based on sample survey data. Analysts interested in using statistical
techniques such as linear regression, logistic regression, or survival analysis on survey data are divided as to whether they feel it is necessary to use specialized software.. A compromise position adopted in STATA is to incorporate into the model the variables that were used to Introduction to Methods for Research in Official Statistics 66 of 179
define the strata, the PSUs and the weights. The STATA commands svy:reg, svy:logit, and svy:probit are available for the regression, logistic regression, and probit analysis procedures. The command svydes allows the user to describe the specific sample design and should be used prior to any of the above commands. Note that although STATA can calculate
standard errors based on bootstrap replication with its bstrap command, its bootstrapping procedure assumes the original sample was selected using simple random sampling. Therefore the bstrap command is not appropriate for complex survey data. Although STATAs strengths have traditionally been centered on crosssection and panel data estimation techniques, (in the family of xt commands such as xtdes, xtreg) there are now a growing set of sophisticated methods for analyzing timeseries data, including, smoothing techniques, Auto-regressive Integrated Moving Average (ARIMA) models, including ARMAX models; a number of auto-regressive conditionally heteroschedastic (ARCH) models; and vector autoregressions (VARs) and structural VARs, impulse response functions (IRFs), and a suite of graphics capabilities for the presentation of IRFs. Both user specified nonlinear least squares and maximum likelihood estimation capabilities are provided in STATA, although there is currently no support for nonlinear systems estimation (e.g., FIML). Whether for timeseries or panel data analysis, STATA allows timeseries operators, including L. for lag, F. for lead, and D. for difference. They may be combined, and may take a numeric argument or range: for example, L2D3.y would refer to the second lag of the third difference of the variable y. Aside from ARIMA and ARCH estimation, the prais command is available for estimating regressions with AR(1) errors via the PraisWinsten or Cochrane Orcutt methods, while the newey command estimates regression models with NeweyWest (HAC) standard errors. The STATA arima command is actually capable of estimating a variety of models beyond the standard BoxJenkins approach, such as a regression model with ARMA disturbances. You may specify a particular optimization methods. The arch command likewise goes beyond ARCH, and actually includes GARCH, ARCHinmean, Nelsons EGARCH, threshold ARCH, and several forms of nonlinear ARCH. Diagnostic tools available for univariate time series include a number of unit root tests, several frequency domain measures Introduction to Methods for Research in Official Statistics 67 of 179
and tests for white noise and ARCH effects, and the DurbinWatson test. Smoothers available in STATA include moving average, exponential, double exponential and HoltWinters smoothers (both seasonal and nonseasonal). Nonlinear filters are also provided. Multivariate timeseries techniques in STATA pertain to vector autoregressions including standard VAR estimation, structural VAR estimation, and the generation of diagnostic tests (such as Granger causality and lagorder selection statistics), dynamic forecasts, forecasterror variance decompositions and impulse response functions in point and interval form. STATA allows users to write their own add-ons and code, reuse and share these code and extensions with other users. STATA thus provides for a vivid exchange of ideas and experiences among its users, who are largely from academic, research and government institutions, as well as international organizations while other statistical software such as SPSS increasingly targets the business world. One big advantage of STATA over other statistical software is that it can also incorporate a sample survey design into the estimation process. Also, it provides a very powerful set of facilities for handling panel / longitudinal data, survival analysis data and time series data. STATA is offered in four versions: Small, Intercooled Stata, Special Edition (SE). and Multi-processing (MP). Small is a student version, limited in the number of variables (99) and observations (1000), but otherwise complete in functionality. The Intercooled Version is the standard version. It supports up to 2,047 variables in a data set, with the number of observations limited by available RAM (technically, as large as 2.147 billion), as the entire data set is held in memory. Intercooled allows matrices of up to 800 rows or columns. The SE version arose during the life of release 7.0 in response to users needs for analyzing much larger data sets. Thus, SE allows significantly more variables in a data set (32,767) than Intercooled or Small, and supports larger matrices (up to 11,000 rows or columns). Advanced multiprocessing may be done through STATA MP version. People often wonder about how Stata compares with other statistical packages. Such a comparison actually depends on the functions and version of a software. For a comparison of Stata release 9 with SAS 9.13 and SPSS 14, see, e.g. the following link:
68 of 179
http://www.ats.ucla.edu/stat/technicalreports/number1_editedFeb_2_2007/ucla_ATSstat_tr1_1. 1_0207.pdf . In Windows 95, STATA can be run on any 386 or higher PC with at least 8 MB of RAM and a math co-processor. STATA actually has cross platform compatibility: the software can run efficiently on Windows (all current versions), Power Macintosh (OS 8.6, 9.X, or OS X), Alpha AXP running Digital Unix, HP-9000 with HP-UX, Intel Pentium with Linux, RS/6000 running AIX, SGI running Irix 6.5, and SPARC running Solaris. A dataset, graph, or add-on program created using STATA on one operating system, e.g., Windows, may be read without translation by STATA running on a different platform, such as MacOS. If you change your platform, all your STATA data, commands and graphs will work on your new platform. There is no need for a data set translation. Stata Corporation also resells a third party program Stat/Transfer (a product of Circle Systems, http://www.circlesys.com ), which supports interchange between several versions of STATA and the binary formats of many other statistical packages and matrix languages. The STATA software includes comprehensive graphics procedures designed to summarize results of the analyses and to assist in diagnostic checking in the use of standard statistical models. Results, including graphics, can be readily copied across to other programs, such as word processors, spreadsheets and presentation software. STATA also provides a graphics subsystem which generates publicationquality graphs. Graphics are highly customizable, and plots of various sorts may be superimposed and juxtaposed. Graphics files are produced in a native format (with extension name .gph) and may be translated within the program into Portable Network Graphics, PostScript, Encapsulated PostScript, Windows Metafile, Windows Enhanced Metafile, and Portable Document File depending on the platform. At present, STATA graphics commands support only twodimensional plots and do not include contour plots but the new graphics language enables the development of three dimensional plots, surface graphs, and the like, and STATA users might expect those capabilities to be forthcoming in future versions or even from enterprising STATA users. In addition, Stata 10 now has the added feature of enabling graphs to be edited (similar to capabilities found in other competitors, such as SPSS for Windows Interactive Graphics). Introduction to Methods for Research in Official Statistics 69 of 179
The simplicity in the use of STATA, coupled with its highend programming capabilities, makes the STATA software an excellent tool for research in official statistics. Although some modules, e.g., multivariate analysis, still need considerable improvement while others, e.g., time series analysis, are still under some development, STATA meets a very sizable percentage of needs for quantitative research. The addition of publicationquality graphics makes it possible to depend on STATA rather than having to export data to other graphics production packages, and the introduction of a GUI, i.e., a menudriven interface, which started in version 8, makes STATA accessible to novice researchers with no prior computing experience. Although no single statistical package can serve all needs, STATAs developers clearly are responding to many research needs. Coupled with STATAs cost effectiveness, this package will undoubtedly become more and more valuable and useful to researchers in official statistics. Data can be comprehensively managed with STATAs versatile commands. There are various data files that can be read into STATA, including worksheets, ASCIII files and STATA datasets. The latter have extension names .dta, i.e., a STATA datafile called hh has a complete file name hh.dta. By default, files are to be read by STATA from the folder c:\data (see the directory status bar in Figure 4-3). If you wish to open a pre-existing STATA dataset in a different folder, you have to tell STATA to change directory. If you wish to read a file called hh found in the directory c:\intropov\data, you merely have to enter the following commands in the STATA Command window: cd c:\intropov\data use hh The cd command tells STATA to change the working directory, which by default is c:\data, to c:\intropov\data while the use command tells STATA to read (and employ) the specified file. These two commands yield the Results window shown in Figure 4-4. Notice that the first and third lines repeat the command you entered; the second line reports a successful change of working directory, while the fourth line implies that the use command has been
70 of 179
executed successfully.
Figure 4-4. Results Window after changing working directory and reading preexisting STATA datafile smallfies. Instead of first issuing a cd command followed by a use command, you could also read the file hh in folder c:\intropov\data\ by simply entering the full path in the use command, as in: use c:\intropov\data\hh
Alternatively, you could read a file by selecting File Open in the Menu Tab, or by clicking on the leftmost icon, an opening file folder , in the Tool bar. Note that the file hh.dta and other Stata data files are available from http://mail.beaconhill.org/~j_haughton/povertymanual.html as part of a set of computing exercises for poverty analysis. The data are a subsample of the microdata from the Household Survey 199899 that was conducted jointly by the Bangladesh Institute of Development Studies (BIDS) and the World Bank. If a file to be read in were too large for the default memory capacity of STATA, then an error message (as in the Results Window of Figure 4-5) would be seen.
71 of 179
Figure 4-5. Resulting error message from reading a large STATA datafile. To cure such a problem, you would have to change the memory allocated by issuing the command: set mem 30m then you can reissue the use command. If the file opens successfully, then the allocated memory is sufficient. If you continue to obtain an error message, you can try 40m or 60m. But be careful not to specify too much memory for STATA. Otherwise, the computer will use virtual memory that will actually slow down your computer. All STATA Error messages are short messages, which include a code. . The code is a link, and by clicking on it, you get more clues to cure the error. Some error messages, however, are not quite informative. Note also that the memory allocation command works only if no data set is open. To clear the memory in STATA, either enter clear or issue the command drop _all
In the use command, if your computer is connected to the internet you can also specify a web address, e.g., use http://courses.washington.edu/b517/data/ozone.dta
72 of 179
In STATA, you can only have one data set open at a time. Suppose that the ozone dataset is open; if you open the hh dataset, you will be replacing the ozone file in memory with this file. You can do this by typing use hh, clear in the command line, or in sequence clear use hh
which, first clears the active memory of data, and then reads the new data set. You can use the count command to tally the number of observations in the data set. count The count command can be used with if-conditions. For the data set hh.dta you can issue the following command to give the number of households whose head is older than 50.
count if agehead > 50 If you want to see a brief description of the dataset in active memory, you can enter the describe command as follows: describe from the command window, or alternatively, you can select Data Describe Data Describe Variables in Memory in the Menu Bar and then click OK after receiving the pop-up window displayed in Figure 4-6.
73 of 179
Either way, we obtain the results shown in Figure 4-7 below that describes the contents of the STATA datafile hh.
Data Describe Data Summary statistics and then click OK after receiving a pop-up window. If you wish to obtain a few more summary statistics, such as skewness, kurtosis, and the four smallest and four largest values, along with various percentiles, you will need to specify the details option: summarize, details If you would like to generate a table of frequencies by some subpopulation, you will have to use the tab(ulate) command. For instance, for the STATA data set hh.dta, if we wish to get the frequencies of sampled households, across the regions we enter the following tab region from the command window, or alternatively, you can select Statistics Summaries, tables & tests Tables One-way tables and use the drop down to select region (or type the word) in the space for categorical variable. If, in addition, you wish to obtain summary statistics for one variable by region, such as the variable distance, then you use the summarize option. That is, you respectively issue the command and obtain: tab region, sum(distance) Options are specific to a command. A comma precedes the option list. Note that omitting or misplacing a comma is a frequent cause of error messages. Another convenient command is the table command, which combines features of sum and tab commands and the by option. In addition, it displays the results in a more presentable form. If you want to obtain the mean distance of nearest paved road and mean distance of bank to dwelling by region (and across the database), then you issue the command: table region, c(mean distance mean d_bank) row
(output deleted)
75 of 179
See the online help for details on the various formatting options available. Note that formatting changes only the display, not the internal representation of the variable in the memory. Note also that the table command can display up to five statistics and not just the mean. The examples shown above pertain to one-way tables. However, it is possible to display two-way, three-way or even higher dimensional tables in STATA. The tab command may be used to generate contingency tables. For instance, if you want to obtain a cross tabulation of households with a family size greater than 3, by region and by sex, you respectively issue the command and generate the result given below: tab region sexhead if famsize > 3 To see percentages by row and by columns we can add options: tab regn sex if fsize > 3, col row
sorts the dataset by village. You could also put an if-qualifier in all these commands. You can also establish new variables in STATA. If you want to compute means (or other statistics) by groups that we construct, for example, the mean household total assets across four age groups of household head ages -- under 20, 21-40, 41-60, and over 60, then we first need to construct these age groups with the generate and replace commands, sort by the constructed variable, and then compute the means. gen agegp=1 if agehead<=20 replace agegp=2 if agehead>20 & agehead<=40 replace agegp=3 if agehead>40 & agehead<=60 replace agegp=4 if agehead>60 label var agegp "age group" sort agegp by agegp: sum hassetg
To employ the by command, it is assumed that a sorting was first initiated. The label command can be used not only to label variables but also to label the values of a variable: label define gp 1 20 & below 2 21-40 3 41-60 4 above 60 label values agegp gp After examining and making changes to a dataset, you may want to save those changes. You can do that by using the STATA save command. For example, the following command saves the changes on the hh.dta file: save hh, replace You can optionally omit the filename above (that is, save, replace is sufficient). If you do not use the replace option STATA does not save the data but issues the following error message: file hh.dta already exists r(602);
The replace option tells STATA to overwrite the pre-existing original version with the new version. If you do NOT want to lose the original version, you have to specify a different filename in the save command, say, save the file with the name hhnew : save hhsnew file hhnew.dta saved Notice that there is no replace option here. Introduction to Methods for Research in Official Statistics 77 of 179
If you have an unsaved data set open and you try to exit from STATA, you will receive the following error message: no; data in memory would be lost r(4);
To deal with this problem, you will have to firstly save the data file and then inform STATA that you want to exit. If you want to exit STATA without actually saving the data file, you will have to instead clear the memory (using the clear command or drop _all command) before informing STATA that you really want to exit. Alternatively, you can enter these two commands together: exit, clear As a researcher, it is important to document all your work for subsequent work either by you or by another researcher. You can readily keep track of work in STATA by creating a log file, which lists all the commands entered as well as the outputs of the commands. Note, graphical outputs, however, are not saved in a log file. The graphs will have to be saved independently. More about graphs will be discussed in subsequent sections of this manual. You can use the Open Log button (the fourth button from the left on the toolbar) to
establish a log file. This opens a dialogue box that requests for a name of a file. The default extension name is .smcl to stand for a formatted log file, although you may opt to use an ordinary log file (which can be read and edited by any word processor, such as Notepad, Word Pad, or Microsoft Word). Formatted logs, on the other hand, can only be read within the STATA software. You can give the logfile a name such as log1, change the default extension name to an ordinary .log file by clicking SAVE AS TYPE, and also change the default folder to an appropriate folder, such as c:\intropov. Alternatively, you may also open a log by entering in the command window log using c:\intropov\log1.log Once a log file is created, all the commands and subsequent output (except for graphical output) are logged into the log file until you ask STATA to stop writing into the log file. After running some STATA commands, you may decide to close/suspend the log by pressing the button for opening a log (which also closes or suspends the log and), this will result in the
dialogue box asking you whether you wish to view a snapshot of the log file, close it, or suspend it. In future STATA sessions, you can decide to overwrite or append to an existing log file.
Every statistical analysis ought to involve graphs which help not only suggest patterns but also allow for a close scrutiny of the data. You can readily generate a number of graphs in STATA that will illustrate the distribution of a variable or suggest relationships among variables. A very simple but useful pictorial representation of the data distribution is a box-andwhiskers plot, which displays a box that extends from the lower quartile to the upper quartile (with the median shown in the middle of the box). The lower quartile, median and upper quartiles are the 25th, 50th and 75th percentiles of the distribution. If the values in a distribution are sorted from lowest to highest, the quartiles divide the distribution into four parts, with the lower quartile being the value for which 25% of the data are below it, the median being the value for which half of the data are below it (and half are above it), and the upper quartile being the value for which 75% of the data are below it. Whiskers are drawn to represent the smallest and largest data within 1.5 IQR from the lower and upper quartiles, respectively, where IQR (the so-called inter-quartile range) is the difference between the upper and lower quartiles. Points beyond the 1.5 IQR limits from the quartiles are considered outliers, i.e., extreme points in the distribution. A box-and-whiskers plot can be obtained in STATA with the graph box command. To generate a box (and whiskers) plot of the family size variable in the hh dataset, either enter in the command window graph box famsize or select from the Menu bar: Graphics Box plot and specify the family size variable members in the resulting dialogue box. Either way, this will generate the box-and-whiskers plot in Figure 4-8. This figure shows that the lower quartile for total members in the family is 4, i.e., 25% of all the families sampled have family sizes 4 of below. Also, 50% of all sampled families have family sizes 5 or below, and 75% of all sampled families have family sizes 6 or below. A family size distribution typically has extremes on the upper tail of the distribution as there are a few families (typically, poor families) with extremely large sizes. For this database from Bangladesh, we readily observe that the usual
79 of 179
family sizes range within 1 to 9, although some family sizes can go as high as 10 up to 17 and these latter values are called outliers and are shown as circles in Figure 4-7.
hist famsize, discrete Introduction to Methods for Research in Official Statistics 80 of 179
household size 10
15
20
in the command window with the option discrete since by default, STATA assumes the variable to be a continuous variable unless the discrete option is specified.
.25 0 .05 .1 Density .15 .2
0.0
5.0
15.0
20.0
Figure 4-9. Histogram of total family members Alternatively, you can generate the histogram in Figure 4-9 by selecting from the Menu bar: Graphics Histogram and specify the famsize variable in the resulting dialogue box, putting a tick on discrete option rather than default (continuous) option. Here, we had a unit width for the intervals. Note that you can also specify the width of the intervals to be some value, say 2: hist famsize, discrete width(2) If the discrete option is not specified, you could suggest how many bins should be generated with the bin option, say, 15 bins : hist famsize, bin (15)
You could also generate histograms disaggregated by some subpopulations, and use weights. For instance, the following command histogram agehead [fw=round(weight)], by(region, total) yields a histogram of the age of the household heads by region (and across the country). The age distributions are weighted by the variable weight. The shape of the histogram is typically influenced by the number of bins and the choice of where to start the bins. Thus, you may want to get a nonparametric density estimate of a
81 of 179
data distribution, i.e., a smoothed version of the histogram. One such nonparametric density estimate is the Kernel density estimate. In STATA, you can generate Figure 4-10 with:
20
40
60 age in years
80
100
Figure 4-10. Kernel density estimate of age of household head In some cases, you may want to determine whether the data distribution can be approximated by a normal curve, so you may want to add the option. kdensity agehead [aw=weight], normal If you are still unconvinced that you can fit a normal curve thru the age distribution, you may want to generate a probability plot, such as the quantile normal plot qnorm agehead with the resulting plot interpreted accordingly to mean that the normal distribution can be fit provided that the points (representing the observed quantiles and the expected normal quantiles) fall along the 45-degree line. The STATA command
produces the bar graph in Figure 4-11 of average distance to nearest paved road and average distance to bank by region.
82 of 179
4.20679
3.33363
2.84255
2.92832
1.79199
1.06412
1.06787
.958716
dhaka
khulna
ragfhahi
mean of d_bank
Figure 4-11. Bar graph of average distance to nearest paved road and average distance to bank by region in hh A scatterplot of food and non-food expenditures of the hosueholds, on the other hand, can be obtained by entering graph twoway scatter distance d_bank or simply entering: scatter food nfood in the command window. Alternatively, we can use the Menu bar to obtain this scatterplot by selecting Graphics Two way graphs And then choosing Create Plot, Basic Plot (Scatter) in the resulting dialogue boxes. and identifying the x variable as d_bank, and the y variable as distance in the resulting pop-up window. Either way, we generate the scatter diagram in Figure 4-12.
15
Figure 4-12. Scatterplot of Distance of nearest paved road versus Distance of nearest bank to dwelling in hh Introduction to Methods for Research in Official Statistics 83 of 179
You can also see a scatterplot with a resulting simple linear regression fit by entering the command twoway (lfitci distance d_bank) (scatter distance d_bank) in the command window or choosing in the Graphical User Interface to create a second plot, a Fit plot (Linear prediction).
84 of 179
is not the same as the correlation of distance and d_bank. Neither is the resulting correlation the log of the initial correlation coefficient we calculated. Although there are actually no hard rules in determining the strength of the linear relationship based on the correlation coefficient, we may want to use the following guide: 0 <r <0.3 0.3 <r <0.7 r >0.7 Weak Correlation Moderate Correlation Strong Correlation
in order to interpret the correlation. For instance, the correlation coefficient of +0.36 between distance and d_bank is a moderate (positive) correlation between the variables. A few caveats should be stressed about interpreting the correlation coefficient. A correlation of 70% does not mean that 70% of the points are clustered around a line. Nor should we claim here that we have twice as much linear association with a set of points, which has a correlation of 35%. Furthermore, a correlation analysis does not imply that the variable X causes the variable Y. That is, association is not necessarily causation (although it may be indicative of cause and effect relationships). Even if polio incidence correlates strongly with soda consumption, this need not mean that soda consumption causes polio. If the population of ants increases (in time) with the population of persons, (and thus these numbers strongly correlate), we cannot adopt a population control program for people based on controlling the number of ants! Also, while the direction of causation is sometimes obvious, e.g., it is rain that causes the rice to grow and not the growth of rice that causes the rain, the direction of causation may not always be clear: what is the relationship between macro economic growth and job creation? Does economic growth come first, yielding more sectors to create more jobs? Or does job creation come first? Often, both variables are driven by a third variable. The weight and height of a person are certainly strongly correlated, but does it make sense to claim that one causes the other? Finally, note that the presence of outliers easily affects the correlation of a set of data so it is important to take the correlation figure with a grain of salt if we detect one or more outliers in the data. In some situations, we ought to remove these outliers from the data set and re-do the correlation analysis. In other instances, these outliers ought not to be removed. In any scatterplot, there will be more or less some points detached from the main bulk of the data, and Introduction to Methods for Research in Official Statistics 85 of 179
these seeming outliers need not be rejected without due cause. A basic guide to the treatment of outliers is given in www.tufts.edu/~gdallal/out.htm. A less sensitive measure to outliers is the rank correlation, the correlation of the ranks of the x-values with the ranks of the y-values. This may be computed instead of the typical (Pearson product moment) correlation coefficient especially if there are outliers in the data. This rank correlation is called Spearmans rho. The command spearman distance d_bank computes for the Spearman rank correlation between distance to nearest paved road and distance to a bank from the dwelling. There is another rank correlation, due to Kendall. To generate Kendalls tau in STATA for distance and d_bank, you need to enter: ktau distance d_bank Sir Francis Galton was the first to consider an investigation of correlations within the context of studying family resemblances, particularly the degree to which children resemble their parents. Galtons disciple Karl Pearson further worked on this topic through an extensive study on family resemblances. Part of this study was generating heights of 1,078 fathers and those of their respective first-born sons at maturity. A plot of these data is shown in Figure 4-13 with the pairs of dots representing the fathers height and the sons height.
Typically, correlation analysis is undertaken to investigate the possibility of explaining causality. In the case of the Pearson study, the analysis led Pearson to establish that the height of the son can be predicted given the height of the father. If there is a weak association between the variables, then information about one variable will not help us in predicting the other variable. More on this will be explained in the context of regression analysis in
86 of 179
It may be important to mention that there exist spurious correlations involving time. In such cases, it is important to remove the time trends from such data before correlating them. Furthermore, correlation calculations are sensitive to sample selection. Restricted sampling of individuals can have a dramatic effect on the correlation. Also, it may be misleading to obtain the correlations for averages, and for cases when the sample comprises different subgroups.
A lot more graphs can be generated with STATA. For instance, we can obtain a pairwise scatterplot (also called a matrix plot) that can be helpful in correlation analysis and in determining whether it makes sense to include certain variables in a multiple regression model. You can generate a pairwise scatterplot with the graph matrix command. As was earlier pointed out, graphs generated in STATA can be saved into a STATA readable graphics file with an extension name .gph. These graphs can also be saved into various graphical formats, including Windows Metafile (.wmf), Windows Enhanced Metafile (.emf), Portable Network Graphics (.png), PostScript (.ps), or Encapsulated PostScript (.eps). Either right click on the graph and choose the desired graphical format to be saved, or click on File Save Graph and respond to the resulting dialogue box.
87 of 179
developing country based on the use of telephones will not reflect the public pulse since access to telephones is not universal. In going about statistical inference, the crucial point is to recognize that statistics, such as a sample mean, vary from sample to sample if the sampling process were to be repeated. We can obtain values of the standard error of an estimate that will provide us a measure of the precision of the estimate, and consequently allow us to construct confidence intervals and/or perform hypothesis tests about the parameter of interest. By virtue of the Central Limit Theorem, we are able to say that we are 95% confident that the population mean is within two standard errors from the sample mean. Thus, if our estimate of the average monthly income based on a probability sample of households is 18,576 pesos with a standard error of 250 pesos, then we are 95% confident that the true average income is between 18076 pesos and 19076 pesos. If we would like to determine whether it is plausible to assume some hypothesized value for the average, such as 20,000 pesos (and merely attribute the difference between the hypothesized value and the sample average to mere chance), then the confidence interval (which does not contain the hypothesized value) suggests that we have no evidence to claim that the average income is 20,000 pesos. Statistical hypothesis testing is about making decisions in the face of uncertainty in the context of choosing between two competing statements about a population parameter of interest. This involves stating what we expect, i.e., our null hypothesis, and determining whether what we observed from the sample is consistent with our hypothesis. The alternative statement that we consider correct if the null hypothesis is considered false is called the alternative hypothesis. As information becomes available, the null and alternative hypotheses are assessed using the data. Agreement between the null hypothesis and the data strengthens our belief that the null hypothesis is true, but it does not actually prove it. The whole process of hypothesis testing is actually subject to errors, the chances of which we would like to be small. Even if our decisions are evidence-based, these decisions themselves may not be perfect as there is inherent uncertainty regarding the evidence. Although, we would like to have perfect decisions, we cannot do so. We have to nonetheless make decisions in the face of uncertainty. Errors in Introduction to Methods for Research in Official Statistics 88 of 179
decisions arise from claiming one statement to be true when, in fact, it isnt. Typically, we are guided to reject the null hypothesis if the chance of obtaining what we observed or something more extreme, the so-called p-value, is small. To illustrate how to perform a hypothesis test, specifically a regular t-test with STATA, consider again the hh.dta, and enter: cd c:\intropov use hh sum famsize with the last command generating the (unweighted) average family size 5.23. You may hypothesize that the average family size is 5, and assess whether the difference between the observed value (of 5.23) and what we expect under the null hypothesis, i.e., 5, may be explained by chance. Intuitively, the smaller the observed difference (here, 0.23) is, the more likely that we can account for this difference as being merely due to chance. So, we desire to find the chance of getting an observed difference of 0.23 or a larger difference. If this chance is small, then this means that we have evidence to suggest that the mean is not 5, i.e., we will reject the null hypothesis that the mean is 5. To run a t-test of the hypothesis that the average family size for the Bangladesh household data file hh.dta is 5 in STATA, you have to enter the following and obtain: ttest famsize=5
One-sample t test Variable Obs Mean famsize 519 5.22736 mean = mean(famsize) Ho: mean = 5 Ha: mean < 5 Pr(T < t) = 0.9930 Ha: mean != 5 Pr(T > t) = 0.0139 Std. Err. Std. Dev. [95% Conf. Interval] .092115 2.098525 5.046395 5.408325 t = 2.4682 degrees of freedom = Ha: mean > 5 Pr(T > t) = 0.0070
518
Notice that right-sided p-value is small, practically zero, (and thus much smaller than the traditional 5% level of significance used). This suggests that the difference between 5.17 (what we observe) and 5 (what we expect) is too large to attribute to chance. Thus, we have to conclude that the average family size for the Bangladesh data is larger than 5. STATA actually has a number of commands for performing various statistical hypothesis tests, such as the one sample t-test or its nonparametric counterparts:
89 of 179
signtest famsize=5
One-sided tests: Ho: median of members - 5 = 0 vs.Ha: median of members - 5 > 0 Pr(#positive >= 202) =Binomial(n = 419, x >= 202, p = 0.5) = Ho: median of members - 5 = 0 vs Ha: median of members - 5 < 0 Pr(#negative >= 217) =Binomial(n = 419, x >= 217, p = 0.5) = Two-sided test: Ho: median of members - 5 = 0 vs. Ha: median of members - 5 != 0 Pr(#positive >= 217 or #negative >= 217) = min(1, 2*Binomial(n = 419, x >= 217, p = 0.5)) = 0.4941 0.7828
0.2470
The results suggest that there is no reason to believe that the median family size is not five since the p-value for the two sided test is rather large, viz., 0.4941 (compared to the typical 0.05 level of significance). Another nonparameteric alternative to the one sample t-test is the Wilcoxon signed rank test (that gives a test on the median). Here, we use the signrank command: signrank famsize=5
Wilcoxon signed-rank test Sign Positive Negative Zero Positive unadjusted variance adjustment for ties adjustment for zeros adjusted variance Ho: members = 5 z = 0.972 Prob > z = obs sum ranks 202 68225.5 217 61664.5 100 5050 202 68225.5 Expected 64945 5050 64945
0.3312
which still suggests that there is no reason to believe that the median family size is not five since the p-value for the two sided test is rather large, viz., 0.3312 (compared to, say, 0.05 level of significance). Beyond the one sample t-test, the ttest command in STATA can also be used for twosample and paired t tests on the equality of means. For instance, you may discover by entering the following STATA command table sexhead, c(mean distance) that average distance to paved roads seems to differ between male and female headed households so that you may wish to enter: Introduction to Methods for Research in Official Statistics 90 of 179
t = 0.5204 degrees of freedom = 517 Ha: diff > 0 Pr(T > t) = 0.3015
in the command window to perform a two sample test of the difference between mean distance to a paved by sex of the household head. This will test the null hypothesis that the difference between the means of is zero. Here, we do not have evidence to suggest a difference between male and female heads. If you are convinced that there is also an inherent difference in variability of incomes between the two subpopulations of families, you can make the Satterthwaite adjustment with:
ttest distance, by(sexhead) unequal To run the nonparametric counterpart of the two-sample t-test in STATA, enter ranksum distance, by(sexhead) and we also see that indeed, there is no statistically significant difference on distance to a a a paved road between male and female headed households. STATA has a number of commands for performing various statistical hypothesis tests, especially for comparisons of two or more groups. The classical one way analysis of variance which tests that the mean is the same across groups against the alternative that at least one of the groups has a different mean from the rest, can be implemented in STATA: oneway distance region [aw=weight], tabulate You can also add the scheffe option in the above command to generate a post-hoc comparison, the result of which suggests that distance to a paved road in Dhaka is significantly different
91 of 179
from corresponding distances in other regions, but there is no evidence that distances are different in regions outside Dhaka. The nonparametric analysis of variance can also be implemented with the kwallis command: kwallis distance, by(region) which also suggests that the difference in distance to a paved road across regions are statistically significant. You can also obtain a boxplot of the distribution of distance to a paved road across regions graph box distance, by(region)
but due to outliers and the scaling of the values of distance, we may not see the difference very clearly in the resulting plot. A number of other comparison tests can be performed in STATA, including the binomial test (through the bitest command), the Wilcoxon Mann Whitney test (with the ranksum command) and even a chi-square goodness of fit. The latter can be implemented through the csgof command which can be downloaded by entering: net from http://www.ats.ucla.edu/stat/stata/ado/analysis net install csgof With the anova command, we can also implement an N-way analysis of variance (including a factorial ANOVA with or without the interactions), an analysis of covariance and even a one-way repeated measures analysis of variance. In the latter case, we have one categorical independent variable and a normally distributed interval dependent variable that was measured at least twice for each subject Perhaps you have noticed in some cross tabulation of two variables X and Y that as one variable increased or decreased, the other variable in the cross tab decreased or increased. While the naked eye can be good at noticing these relationships, it is important to test statistically whether the variables in the cross-tab are independent or not. By independent, we mean whether as X moves one way or another, the movements of Y movements are completely random with respect to X. For instance, Introduction to Methods for Research in Official Statistics 92 of 179
tab toilet hhelec, col row chi2 implements a chi-square test of independence for two variables toilet and hhelec that indicate respectively, whether (1) or not (0) the household has access to sanitary toilet facilities, and whether (1) or not (0) the household has access to electricity. The small p-value (Pr=0.000)
tells us that there is evidence to suggest that the two variables are related. The hypothesis tests discussed thus far do not incorporate the proper analytic/probability weights from the survey data into the results. When handling survey data with complex survey designs, you need to incorporate the survey design into the analysis. STATA provides a scheme for incorporating the survey design variables by way of the family of commands called the svy commands. These commands include:
reg means
svyset svydes svy: mean svy: totals totals svy: prop props svy: ratio ratios svy: tab svy: reg svy: ivreg svy: logit svy: probit svy: mlog svy: olog svy: oprob svy: pois svy: intrg
setting variables describe strata and PSUs estimate popn & subpop estimate popn & subpop estimate popn & subpop estimate popn & subpop for two way tables for regression for instrumental variables for logit reg for probit reg for multinomial logistic reg for ordered logistic reg for ordered probit reg for poisson reg for censored & interval reg
Notice that all these commands begin with svy. For you to be able to use these commands, you have to firstly identify the weight, strata and PSU identifier variables. For hh.dta, enter
93 of 179
svyset vill [w=weight], || hhcode and if you wish to obtain estimates of average distance to paved road, average distance to bank and average family size, then you enter: svy:mean distance d_bank famsize which provides the results displayed in Figure 4-14. Here, we see not only the point estimates for the variables but also the standard error of the mean, and the 95% confidence interval for the mean. A confidence interval is essentially a range of values where we are confident that the true value of the parameter lies. In particular, while our best estimate of family size is 5.19, we are 95% confident that, the average family size is as low as 4.86 and as high as 5.52.
regression model. Regression models have two components: a deterministic law-like behavior called a signal in engineering parlance, and statistical variation or noise. The simple linear regression model relating Y and X is: y i = 0 + 1 x i + i which shows that the signal 0 + 1 X is being contaminated by the variable . The magnitude of the output variable Y is dependent on the magnitude of the input variable X. A households assets, for instance, functionally depends on the years of schooling of the household head. This does not, however, suggest that educational attainment of the household head is the only factor that is responsible for the level of assets of the household, but that it is one possible determinant. All the other variables that may possibly influence the output variable (but which we do not account for) are thought of as lumping into the noise term. To make the model tractable, we assume that the noise is a random variable with zero mean. For each value of the input variable X, say, x i , and correspondingly, each value of Y, say, y i , we may then assume that: y i = 0 + 1 x i + i i =1, 2, ,n
where the noise terms 1 , 2 , , n form a random sample from a normal distribution with zero mean and constant variance. In consequence, this will mean that the points of X and Y will be more-or-less evenly scattered around a line. The parameters 0 and 1 in the regression model, respectively referred to as the intercept and slope, have to be estimated; the classical estimates, also called the least squares estimates, of these parameters can be readily obtained by STATA through the regress command. The estimated regression line ought to be viewed as a sample regression line since we are only working with sample data. This line is the best fitting line for predicting Y for any value of X, in the sense of minimizing the distance between the data and the fitted line. By distance here, we mean the sum of the squares of the vertical distances of the points to the line. Thus, the resulting coefficients, slope and intercept, in the sample regression line are also called the least squares estimates (of the corresponding parameters of the population regression line).
95 of 179
Consider once again the hh.dta data set, and suppose that you want to regress the household assets against the years of schooling of the household head. You would then have to enter the following commands: regress hassetg educhead Notice that in the regress command, the first variable identified is the y variable followed by the explanatory variable. Also, we can use weights in the regress command (whether analytic, importance, frequency or probability weights). The result of the last command (including the command) is shown in Figure 4-15 which lists an analysis of variance (ANOVA) table, information on the model fit statistics, and a table of the estimated regression coefficients.
Figure 4-15. Output of regress command. The ANOVA table decomposes total variability into the variability explained by the regression model, and everything else we can explain. The ratio of the mean squared variation due to the model to the residual mean square forms the F-statistic (here 36.14) as well as the p-value associated with this F-statistic can be used for testing overall model adequacy. An overall model fit involves testing the null hypothesis that X does not help in explaining Y against the alternative that the regression model is adequate; here, the small p value (which is practically zero) suggests that we ought to reject the null hypothesis. That is, there is strong evidence to suggest that the overall fit is appropriate. However, note that, in practically most instances, we get such results. This is merely the first of many tests that need to be done to assess the adequacy of the model.
With the menus, one can select Statistics Linear Models and Related Linear Regression Introduction to Methods for Research in Official Statistics 96 of 179
and merely select the dependent and explanatory variables in the resulting dialogue box (see Figure 4-16) to run a regression.
Figure 4-16. Dialogue box for performing a regression. The estimated regression model is
which suggests that as the number of years of schooling of the head increases, we expect the assets of a household to increase by 29975 taka. The utility of the estimated regression line is not merely for explaining relationship between two variables X and Y (in the illustration above, explaining the relationship between total household members and log per capita income, respectively) but also in making predictions on the variable Y given the value of X. Suppose, that you wish to pick one of the households at random, and you wish to guess its assets. In the absence of any information, the best guess would naturally be the average of household assets in the database, viz., 188,391.5 taka. If, in addition, you are provided information about the educational attainment (i.e., years of schooling) of the household head, say, 3, then according to our estimated regression line, we expect the assets of this household to be Estimated hassetg = 113716+29974.14*3= 203,638.42 which is more than the average of assets variables. This calculation may actually be obtained in STATA with the display command: display 113716+29974.14*3
97 of 179
If you are convinced that the estimated regression model is adequate and you wish to obtain predicted y-values for regressions on x, then after you have issued a regress command, you can issue a predict command. In the case of the regression of the log per capita income on family size, you can enter:
predict assethat label variable assethat "predicted mean asset score" and these commands respectively generate a new variable from the earlier regress command that regresses hassetg on educhead, and label the resulting variable called assethat as the predicted (average) hassetg for a given value of educhead. To obtain a scatterplot of the variables hassetg and educhead along with the estimated regression line, you need to merely enter: graph twoway line assethat educhead || scatter hassetg educhead and thus obtain the graph shown on Figure 4-17.
10000002000000300000040000005000000
0.000
5.000 10.000 education (years) of hh head predicted mean asset score hh: total asset (taka)
15.000
Figure 4-17- Scattplot and fitted regression line In theory, the intercept in the regression line represents the value of Y when X is zero, but in practice, we may not necessarily have this interpretation. If the explanatory variable were family size, say, then zero members in the household would mean no household! The intercept here merely represents the value of Y for the estimated regression line if the line were to be extended to the point where X is zero.
98 of 179
Note that the hypothesis test for performing the overall model fit is equivalent to the test of the null hypothesis that the slope is zero, which is, in turn, also equivalent to the test of the null hypothesis that the population correlation coefficient is zero. The only difference is that the statistic used for overall model fit (an F statistic) is one sided while the t-statistic for the slope (and the correlation) is two-sided. The STATA output in Figure 4-16 for the t-statistic lists a one-sided p-value. The p-value associated with the F-statistic is twice the p-value for the t-statistic for the coefficient of the educhead variable in Figure 4-16. Of course, both values here are practically zero and both suggest that the regression fit is a good one. You can also choose to suppress the constant in the regression model, i.e. perform regression through the origin, and for this, you enter the nocons option: reg hassetg educhead, nocons to inform STATA that you wish the intercept to take a value of zero. You can also choose to incorporate the survey design into the regression by way of the svyreg command: svy:reg hassetg educhead Here, the resulting estimated coefficients are no different, but the standard errors for the estimated coefficients are adjusted upward from the earlier standard errors calculated from the classical regression model. Regression analysis involves predicting the value of a dependent (response) variable Y based on the value of one independent (explanatory) variable X. (Later, we will extend the condition of a single independent variable to a number of independent variables in the context of the multiple regression model.) The relationship between the variables is described by a linear function; the change in one variable causes the change in the other variable. Moving from correlation (and regression) to causation is often problematic as there are several possible explanations for a correlation between X and Y (excluding the possibility of chance): it may be that X influences (or causes) Y; Y influences X; or both X and Y are influenced by some other variable. When performing correlation analysis of variables where there is no background knowledge or theory, inferring a causal link may not be justifiable Introduction to Methods for Research in Official Statistics 99 of 179
regardless of the magnitude of the correlation. While there may be a causal link between alcohol consumption and liver cirrhosis deaths, it may be difficult to make anything of the high correlation between pork consumption and cirrhosis mortality. Arm length and leg length are correlated but certainly not functionally dependent in the sense that increasing arm length would not have an effect on leg length. In such instances, correlation can be calculated but regressing leg length on arm length may not be of practical utility. In many cases, obtaining a regression fit gives a sensible way of estimating the y-value. If, however, there are nonlinearities in the relationship between the variables, one may have to transform the variables, say, generate firstly the square root or logarithms of the X and/or Y variables, and then perform a regression model on the transformed variables. When using transformed variables, however, one will eventually have to re-express the generated analyses in terms of the original units rather than the transformed data. In practice, we may use a power transformation, especially one suggested by the Box-Cox transformation, a family of transformations indexed by :
It is desired to choose an optimal value of . This can be readily worked out in STATA with the boxcox command: boxcox hassetg educhead if hassetg>0 which suggests that the variable hassetg should be raised to +0.03 to improve the model fit, or even to logarithm terms (i.e, when parameter is estimated to be zero). You can also choose to have the explanatory variable subjected to a transformation. Typically, you may opt to use a log-transformation as this is said to be variance stabilizing, i.e., big values are made into small numbers and small numbers remain as small numbers so that the range of values gets bunched up. If the dependent variable Y is transformed to the log scale, the coefficient of the independent variable can be approximately interpreted as follows: a 1 unit change in the variable X leads to an estimated 100(b 1 ) percentage change in the average of Y. If the independent variable X is logged then the Introduction to Methods for Research in Official Statistics 100 of 179
regression coefficient of X can be approximately interpreted as follows: a 100 (b 1 ) percent change in X leads to an estimated unit change in the average of Y. If both the dependent and independent variables are logged, the coefficient b1 of the independent variable X can be approximately interpreted as: a 1 percent change in X leads to an estimated b 1 percentage change in the average of Y. Therefore b 1 is said to be the elasticity of Y with respect to a change in X. If both Y and X are measured in standardized form, i.e.:
and
The coefficient b 1 is called the standardized coefficient. It indicates the estimated number of average standard deviations Y will change when X changes by one standard deviation (of X). There are a number of fundamental assumptions in the regression model. These
include (a) the value of the Y variable is composed of a linear function of X and a noise variable; (b) the noise terms form a random sample with constant variance; (c) the noise distribution is a normal distribution. As a result of these assumptions, the Y values are themselves normally distributed for each value of X, as shown in Figure 4-18. These assumptions all pertain to the behavior of the noise values, which are unknown but can be estimated by the residuals, the difference between the observed Y- values and their predicted values. An analysis of the residuals will help us ascertain whether the assumptions of the regression model are tenable.
Figure 4-18. Distribution of Y around the regression line After regressing the variable hassetg on the variable educhead and generating a predicted value assethat, you could also obtain the residual values (with the predict command): Introduction to Methods predict r, resid for Research in Official Statistics label variable r "residual" 101 of 179
Residuals serve as estimates of the noise, and so, the residuals contain information on whether the regression model assumptions are valid for the data being analyzed, i.e., whether the model fit is adequate. To validate the assumption of a normal distribution for the noise, you could look into the distribution of the residuals and see whether it is sensible to fit a normal curve through the distribution. You could either look through the kernel density estimate of the distribution and fit a normal curve with the command kdensity r, normal or inspect the quantile-normal plot of the residuals. qnorm r Here, we ought to see the values of the observed quantiles of the data being more or less similar to the expected quantiles of the normal distribution, or equivalently having the points formed from the observed and expected quintiles falling along the 45 degree line passing through the origin. You could also further validate the results of the quantile-normal plot with a test of normality, such as the Shapiro Wilks test, which has for its null hypothesis that the data distribution is not normal: swilk r The results of all these STATA commands on the residuals of the resulting regression model are shown in Figure 4-19.
6000000
residual 2000000
4000000
0
-2000000
1000000
4000000
5000000
-1000000
-500000
1000000
1500000
102 of 179
(a )
(b)
(c) Figure 4-19. Testing residual normality through (a) a quantile normal plot; (b) a kernel density estimate with a fitted curve; (c) the Shapiro Wilks test.
Plots of residuals against predicted values, residuals against explanatory variables, and residuals against time are also useful diagnostic plots for assessing regression model adequacy. The command rvfplot obtains a plot of the residuals versus fitted values, while rvpplot educhead yields a plot of the residuals versus the values of the educhead variable. You can also enter these commands using the menus: Statistics Linear Models and Related Regression Diagnostics A good guide for assessing residual plots is shown in Figure 4-20, which suggests that these plots ought not to display any patterns for the regression model to be considered appropriate.
(a)
(b)
Figure 4-20. Guide for analyzing residual plots to detect (a) linearity from nonlinearity; (b) homoscedascticity from heteroscedasticity.
If the residuals from a regression involving time series data are not independent, they are said to be autocorrelated. The well known test for autocorrelation is the Durbin-Watson test, which makes sense only if the data have some inherent order, as in a time series. Let us Introduction to Methods for Research in Official Statistics 103 of 179
pretend that observation number indicates the time at which the data were collected. Stata has a system variable called _n, which we will need to copy into a new variable, say, obsnum with the generate command. We will also need to let Stata know that obsnum is the time variable with the tsset command. The post estimation command dwatson performs a Durbin-Watson test.
gen obsnum=_n tsset obsnum reg hassetg educhead estat dwatson Some words of caution ought to be mentioned when performing a regression analysis. Do not attempt to automate regression. Lacking an awareness of the inherent assumptions underlying least-squares regression will lead to ready acceptance of any regression model (even those which do not have a good fit to the data). This inherent lack of awareness, may involve not knowing how to evaluate the assumptions and not knowing that there are alternatives to least-squares regression if a particular assumption is violated. Also, it may be dangerous to merely apply a regression model without knowledge of the subject matter.
There are some suggestions on how to properly perform a regression analysis. Firstly, start with a scatter plot of X on Y to observe if there is some suggestion of a possible relationship between the variables. After running the regression, perform a residual analysis to check the assumptions. Use a histogram, stem-and-leaf display, box-and-whisker plot, or normal probability plot of the residuals to uncover possible non-normality. If there is violation of any assumption, use alternative methods, e.g., robust regression, or transform the X-variable, the Y-variable, or both. You can use weights (analytic or probability) to combat heteroscedasticity. If there is no evidence of assumption violation, then you can test for the significance of the regression coefficient, and then consequently, construct confidence intervals and prediction intervals. In the simple linear regression model, we only related one variable Y to some explanatory variable X. We may want to instead relate the response variable Y to a number of explanatory variables X 1 , X 2 , , X p . The multiple regression model
104 of 179
Y = 0 + 1 X 1 + 2 X 2 ++ p X p + where is the noise, serves this purpose. Least squares estimates b 0 , b 1 , , b p of the parameters 0 , 1 ,, p respectively lead us to an estimated fit: =b 0 +b 1 X 1 +b 2 X 2 ++b p X p and just as in the simple linear regression model, we can also represent the value of Y as the sum of the fitted value and a residual term e: Y= +e
and also perform a residual analysis to assess the adequacy of the multiple regression model. The least squares estimates are quite difficult to calculate by hand, but they take a closed form solution and can be obtained very readily with the use of a software, such as STATA (cf. Figure 4-21).
Figure 4-21. Complications in calculating least squares estimates. With STATA, we merely list the explanatory variables after the response variable. For
instance, in the hh data set, if you wish to regress the asset variable hassetg on years of schooling of the head, family size and the binary indicator variable representing whether (1) or not (0) the household head is male, you enter the following commands and obtain the result:
105 of 179
The output suggests that our estimated regression model for predicting assets from the years of schooling of the head, family size and the sex of the head is:
Thus, assets are expected to increase by an estimated 30,000 for each additional year of schooling of the head everything else held constant. Also asssets are 26,000 higher for each additional increase in family size ceteris paribus. The regression model can be used to predict value of assets given the family size, years on schooling and information on whether or not the sex of the head is male. The proportion of variation in the assets variable explained by the estimated regression model is 8.26%. Adjusted for the number of explanatory variables used in the model, the proportion of variation in the assets variable explained by the regression model is 7.73%. The adjusted coefficient of multiple determination (often called Multiple R-squared) reflects both the number of explanatory variables and sample size in its calculation and is smaller than Multiple R squared. It penalizes the excessive use of independent variables. Since the unadjusted figure automatically increases with more independent variables irrespective of whether these significantly contribute additional information in explaining the dependent variable, the adjusted Multiple R squared is the more useful criterion for comparing the fit of various models. The overall test of significance of the regression model, involving the null hypothesis that: H0 : 1 = 2 = 3 = 0 Introduction to Methods for Research in Official Statistics 106 of 179
suggests that the model is adequate since the p-value is rather small (in comparison with either a 0.05 or even a 0.01 level of significance). That is, there is evidence that at least one explanatory variable affects the response variable. The associated statistic for this test is an Fstatistic with numerator degrees of freedom equal to the number of parameters minus one (here 4-1=3), and denominator degrees of freedom equal to the number of data minus the number of parameters (here, 519-4=515). Actually, we will often obtain a rejection of this null hypothesis since the null hypothesis (that all the regression coefficients associated with the explanatory variables is zero) is such a strong statement (especially when we are considering a number of explanatory variables). Note that individual t-tests on the significance of the variables used in the regression suggest that if the both variables are in the regression model, we cannot remove one of them as both are conditionally important (given the presence of the other). In particular, testing the null hypothesis 3 =0 associated with the sexhead variable versus the alternative 3 0 with the use of the t-statistic, here taking the value -1.01, suggests we cannot reject the null hypothesis since the associated p-value (31.5%_ is rather large (compared to 5%). This means that there is no linear relationship between assets and sex of head, all other things being equal. For such individual tests of significance, the t-statistic is the ratio of the estimate of the regression coefficient to the standard error. The larger the magnitude of the t-statistic, the more convincing that the true value of the parameter is not zero (as this means the numerator, i.e., the estimate is quite far in relative terms to zero). Note that aside from the overall F test and individual t tests, you could also perform other hypothesis tests. You could, for instance, perform F tests for user- specified hypotheses. Suppose that from the region variable representing the major island where the household resides (1 for dhaka, 2 for chittagon, 3 for khulan and 4 for ragfhahi), we create four binary indicator variables, and we decide to add the regn1 variable (representing whether or not the household resides in dhaka) in the earlier regression model and jointly test the importance of the variables famsize and regn1 in the model: Introduction to Methods for Research in Official Statistics tab region, gen(regn) 107 of 179
The first command above yields the four binary indicator variables from the categorical variable region. The second command performs the regression without showing the output, since this regression model is merely an intermediate step toward the final result, i.e., testing the significance of the famsize and regn1 variables.
Statas estimates command makes it easier to store and present different set of estimation results: quietly reg hassetg educhead estimates store model1 quietly reg hassetg educhead agehead famsize sexhead estimates store model2 Statas estimates command makes it easier to store and present different set of estimation results, say, identifying regression coefficients:
estimates table model1 model2, stat(r2_a) b(%5.3f) star title("Models of household assets") or presenting precision of estimated regression coefficients:
estimates table model1 model2, stat(r2_a) b(%5.3f) se(%5.3f) p(%4.3f) title("Models of household assets")
It is assumed that in building the regression model we properly chose explanatory variables backed by an appropriate theory or practical reasons. Transforming either or both independent and dependent variables may be needed to facilitate analysis. Choice of transformation can be based on theory, logic or scatter diagrams. transformation may induce linearity. Consider the relationship Occasionally, some
108 of 179
ln Yi = 0 + 1 X 1i + 2 X 2i +
When we may not have an idea about what transformation we may need, it may be helpful to use the Box-Cox transformation as illustrated earlier. Log transformations are useful, as changes in the log of a variable are roughly the equivalent of one percentage point change in the variable.
Sometimes, our explanatory variables may have very large correlations. This is known as multicollinearity. This means that little or no new information is provided by some variables and this leads to unstable coefficients as some variables are becoming proxy indicators of other variables. When we add a new variable into the regression that is strongly related to the explanatory variables already in the model, multicollinearity is suggested when (a) we obtain substantially higher standard errors; (b) unexpected changes in regression coefficient magnitudes or signs; (c) nonsignificant individual t-tests despite a significant overall model F test. Following http://www.ats.ucla.edu/stat/stata/webbooks/reg/chapter2/statareg2.htm , we can measure multicollinearity by way of calculating the variance inflation factors (VIFs):
use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2, clear quietly regress api00 acs_k3 avg_ed grad_sch col_grad some_col Vif
Variable VIF 1/VIF
All of these variables measure the education of the parents and the very high VIF values indicate that these variables are possibly redundant. For example, after you know grad_sch and
109 of 179
col_grad, you probably can predict avg_ed very well. In practice, we ought to be concerned when some variables have VIF values greater than 5. Now, let us try to re-do the analysis but with some variables removed:
Notice that the VIF values are now better and the standard errors have been reduced. There are other ways of addressing multicollinearity apart from dropping variables in the regression. The researcher may even have the option of not addressing it at all if the regression model is to be used for prediction purposes only. (1998), for example. The interested reader may refer to Draper and Smith
As the number of explanatory variables grows, the number of regression models to compare will also grow considerably. For 10 explanatory variables, we have to compare 1023 models. Instead of choosing the best model among all possible regressions, we may adopt computational algorithms for regression model variable selection: Forward Inclusion Backward Elimination Forward Stepwise Backward Stepwise
These can be readily implemented in Stata. For instance, forward-stepwise is implemented with: use c:\intropov\data\hh, clear sw, pr(0.10) pe(0.05) forward: reg hassetg educhead agehead famsize sexhead
Note: both pr and pe are specified as options for probability to remove and probability to enter, respectively; also forward option is used else backward stepwise implemented. The above choice of pr and pe makes it more difficult for a regressor to enter the model than to be removed from the model. Forward stepwise is a modification of forward inclusion, which starts with no regressors in the model, and proceeds as in forward inclusion entering first the Introduction to Methods for Research in Official Statistics 110 of 179
explanatory variable with the largest simple correlation with the dependent variable. It is entered if the F statistic for the estimated regression model with it in exceeds the F value corresponding to pe=0.05. Then, it enters a second regressor, which has the largest correlation with the dependent variable after adjusting for the effect on the dependent variable of the first regressor entered. (Such a correlation is called a partial correlation). The computational procedure then reassesses included regressors based on their partial F-statistics for exclusion. It drops the least significant variable from the model if its partial F-statistic is less than the preselected F corresponding to the pr value. Then, it proceeds as in forward inclusion and enters next regressor, and iterates on the entry and reassessment subroutines.
The computational algorithms (except for all possible regressions) do not necessarily produce the best regression model. The various algorithms need not yield exactly the same result. Other software, e.g. R and SAS, use other criteria, e.g. adjusted R2, Cp Mallows, Akaikes Information Criterion (AIC), or the Schwarz Bayesian Information Criterion (BIC() rather than probabilities (or F values) to remove or enter variables in the model.
Note that Stata allows a hierarchical option for a hierarchical selection of variables, and a lock option that forces the first (set of) regressor variable/s listed to be included in the model, and a hierarchical option. Aside from multicollinearity, the need to use transformations on the variables, and the use of stepwise techniques, there are other issues on the use of regression models, e.g. the need for outlier detection and influence diagnostics, and testing for misspecification. We provide in the next sub-sections a discussion of logistic regression, and common multivariate tools that may be of help in official statistics research. 4.3.2. Logistic Regression and Related Tools with STATA Logistic regression, also called logit analysis, is essentially regression with a binary {0,1} dependent variable. In a nutshell, it is regression with an outcome variable (or dependent variable) that is a categorical dichotomic and with explanatory variables that can be either continuous or categorical. In other words, the interest is in predicting which of two possible
111 of 179
events are going to happen given certain other information. STATA provides two commands: logistic and logit for implementing logistic regression: logit reports coefficients logistic reports exponentiated coefficients To illustrate the use of these commands, consider again the hh.dta database, and suppose that we define households as poor or non-poor through the variable povind depending on whether their household assets per capita is less than 15,000. Now let us generate a per capita asset variable, which we will call assetpc, and the povind variable; let us also run a logistic regression with Statas logistic command: use c:\intropov\data\hh, clear use c:\intropov\data\hh, clear gen assetpc=hassetg/famsize gen assetpc=hassetg/famsize gen povind=1 if assetpc<15000 gen povind=1 if assetpc<15000 replace povind=0 if assetpc>=15000 replace povind=0 if assetpc>=15000 logistic povind educhead logistic povind educhead
The results indicate that years of schooling of the head (educhead) is a statistically significant predictor of povind (i.e., a household being poor), with the Wald Statistic taking a value z = 5.47, with a very small p-value 0. Likewise, the test of the overall model is statistically
significant, with a Likelihood Ratio chi-squared statistic 32.47 and a corresponding negligible p-value 0.00 If we to run the logit command: logit povind educhead
112 of 179
which shows that the estimation procedure for logistic regression involves maximizing the (natural) log of the likelihood function. Maximizing this nonlinear function involves numerical methods. At iteration 0 of the algorithm, the log likelihood describes the fit of a model including only the constant. The final log likelihood value describes the fit of the final model L = 0.5760263- -.1489314* educhead where L represents the predicted logit (or log odds of a household being poor), that is: L =ln [ Pr(poor) / Pr(nonpoor) ] An overall likelihood ratio 2 test (with degrees of freedom equal to one less than no. of parameters = 2-1) evaluates the null hypothesis that all coefficients in the model, except the constant, equal zero x2= - 2( L i L f ) = -2 [-357.0325- (-340.79956) ] = 32.47 where L i is the initial log likelihood and L f is the final iterations log likelihood. A convenient but less accurate test is given by the asymptotic Z statistic associated with a logistic regression coefficient. Note that the logit Z and the likelihood ratio 2 tests sometimes disagree, though they do here. The 2 test has more general validity. In the STATA output, we find a pseudo R2 statistic, pseudo R2 = 1- (L f / L i ) = 1- (-340.79956)/(-357.0325) =.04546628
which provides a quick way to describe or compare the fit of different models for the same dependent variable. Note that unlike the R2 in classical regression, the pseudo R2 statistics lack the straightforward explained-variance interpretation.
A number of add-on programs in Stata, such as the package spostado, have been developed for ease in the running and interpretation of logistic regression models. To download the complete spostado package, merely enter the following in the Stata command window: net from http://www.indiana.edu/~jslsoc/stata/ net install spost9_ado Introduction to Methods for Research in Official Statistics 113 of 179
Consider now the utilities developed by J. Scott Long & Jeremy Freese: listcoef
We could also specify that percent change coefficients be listed: listcoef, percent
or request the listing of some model fit statistics with the fitstat command:
fitstat
Another model fit statistic is the Hosmer & Lemeshow goodness-of-fit indicator given by the lfit command: lfit
After logit, (as in regress), we can issue the predict command. probabilities. predict phat label variable phat "predicted p(poor) " graph twoway for connected phat educhead, sort Introduction to Methods Research in Official Statistics
114 of 179
.2
.3
predicted p(poor) .4 .5
.6
0.000
15.000
The goal in logistic regression is to try and estimate the probability that the observation is from some sub-population 1 ( here, the poor households)
p ( y = 1 | x) =
exp( 0 + 1t x) 1 + exp( 0 + 1t x)
The logit, or log odds, is defined as the log of the odds ratio, i.e. the ratio of the probability falling in sub-population 1 to the probability falling in the other sub-population :
p ( y = 1 | x) g (x) = log 1 p ( y = 1 | x)
which, can be shown as:
g (x) = 0 + 1t x
Thus, as the explanatory variable increases by one unit, the log of the odds changes by units. Or equivalently, exp is an indicator of the change in the odds because of a unit change in the explanatory variable. In particular, the coefficient on educhead (given by logit command) describes the effect of the number of years of schooling of the head on the logit (or log odds of a household being poor). Each additional increase in years of schooling of the head decreases the predicted log odds of being poor by 0.1489314. Equivalently, each additional increase in years of schooling of the head multiplies the predicted odds of being poor by exp(-0.1489314)= 0.86162822; an increase of two years of schooling of the head multiplies the predicted odds by Introduction to Methods for Research in Official Statistics 115 of 179
exp(-0.1489314 *2)= 0.74240319; an increase of N more members multiplies the predicted odds by exp(-0.1489314*N). Interestingly, STATA allows us to obtain the predicted probability of being poor by the variable educhead prtab educhead
After fitting a logistic regression model, we can obtain a classification table with estat classification The output, shown in Figure 4-23, suggests that the use of the logistic regression model results in an overall correct classification of 62%. Of the 286 poor households, we have a sensitivity rate of 78%, (i.e., 224 were correctly classified as poor), but of the 233 nonpoor households, we have a specificity rate of 41% (i.e., only 96 of the nonpoor were correctly classified), so that the overall correct classification rate is 62% = (224+96)/ (286 +233).
116 of 179
In practice, we use the estimated probabilities to discriminate, i.e. to predict whether an observation is in sub-population 1, using some probability cut-off, say 0.50. The plot of sensitivity and specificity versus values of the probability cut-off can be obtained with: lsens Thus yielding Figure 4-24.
1.00 0.00 Sensitivity/Specificity 0.50 0.75 0.25
0.00
0.25
0.75 Specificity
1.00
Figure 4-24. Sensitivity/Specificity versus Probability Cut-off. Note that when using survey data, we should incorporate the probability weights [pw=weight] within the logit or logistic command, or use the svy:logit command. Also, you could add some more variables to improve the model fit, and/or perform model diagnostics (See Hosmer and Lemeshow, 2000 for details). Various prediction and diagnostic statistics can be obtained with the predict command options:
linear prediction (predicted log odds that y=1) standard error of linear prediction analogous to Cooks deviance residual change in Pearson chisq ddeviance change in dev chisq Leverage hat assigns numbers to x number Pearson residual resid standardized resid rstandard
To perform some logistic regression model diagnostics with a model that has more
explanatory variables (and incorporates the use of the xi command for automatically generating
117 of 179
indicator variables on some categorical explanatory variable, here region), we may want to try the following: xi: logit povind educhead agehead famsize i.region predict pprob, p predict r, residuals predict h, hat predict db, dbeta predict dx2, dx2 scatter h r, xline(0) msym(Oh) jitter(2) And thus yield, Figure 4-25.
.06 .01 .02 leverage .03 .04 .05
-2
-1
0 1 Pearson residual
while twoway (scatter dx2 pprob if povind, msym(Oh) jitter(2)) (scatter dx2 pprob if ~povind, msym(Oh) jitter(2)) generates Figure 4-26.
H-L dX^2 4
.2
.6 H-L dX^2
.8
There are other related STATA commands, including: Introduction to Methods for Research in Official Statistics 118 of 179
ologit
ordered logistic regression, where the y variable is an ordinal variable (e.g., 1= ultra poor, 2=poor but not ultra poor, 3=nearly poor, 4=nonpoor who are not nearly poor) multinomial logit regression where y has multiple but unordered categories logit regression incorporating survey design variables conditional logistic regression ordered logistic regression
STATA's mlogit command can be used when the dependent variable takes on more than two outcomes and the outcomes have no natural ordering. The svy:logit command is used for survey data, especially when the design is a complex survey design. The clogit command performs maximum likelihood estimation with a dichotomous dependent variable; conditional logistic analysis differs from regular logistic regression in that the data are stratified and the likelihoods are computed relative to each stratum. The form of the likelihood function is similar, but not identical, to that of multinomial logistic regression. In econometrics, the conditional logistic regression is called McFadden's choice model. STATA's ologit command performs maximum likelihood estimation of models with an ordinal dependent variable, a variable that is categorical and in which the categories can be ordered from low to high, such as "poor", "fair", and "good". Another alternative to logit regression is probit regression, implemented with the command probit. If we observe covariates X 1 , X 2 ,, X p of some latent variable Z for which:
Z = b 0 + b 1 X 1 + b 2 X 2 + + b p X p and with Z being somewhat observed only in that each observation is known to be in one category (high Z values) or another (low Z values). If we assume Z has a normal distn, then the model is called a probit model, if Z has a logit distn, then the model is a logit model.
probit regression giving changes in probabilities instead of Coefficients probit regression with grouped data probit regression with selection heteroscedastic probit estimation probit regression w/ oridinal y variable probit regression incorporating survey design probit regression with selection incorporating survey design ordered probit regression incorporating survey design
119 of 179
Another alternative to logit (and probit) regression is discriminant analysis, a statistical technique (first introduced by Sir Ronald Fisher) that identifies variables important for distinguishing among mutually exclusive groups and predicting group membership for new cases among a list of proposed (explanatory) variables. The concept behind discriminant analysis is to form variates, linear combinations of numerical independent variables, which are used for classifying cases into the groups. Like logistic regression, discriminant analysis predicts group membership and in practice, requires an estimation (training) and validation (learning) sample to best assess predictive accuracy of the model.
Discriminant analysis requires categorical dependent variables and multiple numeric explanatory2 variables. If the latter were assumed to follow a multivariate normal distribution, we can use the F test. A discriminant analysis (implicitly) assumes that all relationships are linear. In addition, linear discriminant analysis3 assumes unknown, but equal dispersion or covariance matrices for the groups. This is necessary for the maximization of the ratio of variance between groups to the variance within groups. Equality can be assessed by Boxs M test for homogeneity of dispersion matrices. Remedies for violations include increasing sample size and using nonlinear classification techniques. Multicollinearity among the explanatory variables may impede inclusion of variables in a stepwise algorithm. Also, there ought not to be any outliers; the presence of outliers adversely impacts the classification accuracy of discriminant analysis 3. Diagnostics for influential observations and outliers should be done.
Selection of explanatory variables in discriminant analysis maybe based on previous research, theoretical model, or intuition. It is suggested that for every explanatory variable
Independence among explanatory variables is not assumed in Fishers DA. The variables in Fishers classic example, the iris data set, are highly correlated. Linear discriminant analysis assumes equal covariance matrices for the groups. Quadratic discriminant analysis allows for unequal covariance matrices.
3 2
120 of 179
used, there should be about 20 cases. Also, at minimum, the smallest group size must exceed the number of explanatory variables with each group having at least 20 observations.
To understand the idea behind discriminant analysis, consider Figure 4-27. Suppose that, an observation (representing say, an individual respondent, household, farm, or establishment) can be classified into one of two groups on the basis of a measurement of one characteristic, say Z, then we need to find some optimal cut-off value C, which divides the
Figure 4-27. Discriminant analysis for two groups (given one variable Z) is to obtain an optimal rule which divides the data into two and allows us to classify the data to belong to one group or the other. entire dataset in such a way that high values of Z are assumed to indicate that the observation comes from the first group, and low values indicates that the observation comes from the second group. Of course, there will be misclassifications according to this rule. The classification rule will lead to misclassifications if an observation from the second group actually has a high value of Z, but it is classified by the rule as coming from the first group. Also, misclassification results when cases from the first group with low values of Z are classified by the rule as coming from the second group. In practice, though, we dont usually classify on the basis of just one variable. We may have a number of variables at our disposal. Suppose though, that all the variables can be put together into one linear function Z. This is illustrated in Figure 4-28 for two variables for the two-group case.
121 of 179
Figure 4-28. Illustration of Discriminant analysis for two groups with two variables. The characteristic Z used for the classification process is called a linear discriminant function. Given explanatory variables X 1 , X 2 , X p the discriminant function is: Z = b 0 + b 1 X 1 + b 2 X 2 + + b p X
p
where the bjs are chosen so that the values of the discriminant function differ as much as possible between the groups. A discriminant analysis can be readily implemented in STATA with the discrim function We illustrate the use of this command once again on hh.dta:
discrim lda educhead, group(povind) estat loadings, unstandardized This yields the result in Figure 4-29.
(a) (b) Figure 4-29. Results of discriminant analysis: (a) classification matrix; (b) estimated discriminant function.
122 of 179
Figure 4-30.Three factors formed from nine observed variables. In practice, the process for obtaining the derived factors is not so clear cut. There is usually some amount of overlap between the factors since each of the observed variables defining a factor has some degree of correlation with the other variables in the other factors. Unlike in regression methods, when performing a factor analysis, it would be important for multicollinearity to be present. In fact, the rule of thumb is that there should be a sufficient number of correlations greater than 0.30. The factor analysis model indicates that the variables have a common and a unique part: Z j =a j1 F 1 +a j2 F 2 + + a jm +F m U j j= 1,2, ,k
where the variables are written in standardized form Z, F are the factors (with the coefficients called factor loadings) and U represent a factor unique to a certain variable. Note that we would prefer to have the number m of factors much less than the number k of variables. While Introduction to Methods for Research in Official Statistics 123 of 179
the factor analysis model looks similar to a regression model, we have no dependent variable when using factor analysis. The variables being used in a factor analysis ought to be metric (i.e., numerical). The variables need not be normally distributed but if the variables are multivariate normal, we could apply some statistical tests. When designing a factor analysis, a researcher should include a sufficient number of variables to represent each proposed factor (i.e., five or more). Note that sample size is an important consideration. Typically, the number of data to be analyzed in a factor analysis should be 100 or larger. Between 50 and 100 may still be analyzed with factor analysis but with caution. Note that it is suggested that the ratio of observations to variables should be at least 5 to 1 in a factor analysis. To perform a factor analysis, you ought to perform the following six steps: STEP 1: Collect data and perform a correlation analysis; STEP 2: Extract initial factors (using some extraction method) ; STEP 3: Choose number of factors to retain; STEP 4: Rotate and interpret; STEP 5: Decide if changes need to be made (e.g. drop item(s), include item(s)); and if changes are made, repeat STEPS 2-4; STEP 6: Construct scales and use them in a further analysis. After organizing the data into the array shown in Figure 4-31(a), you need to obtain the correlation matrix of the variables (as shown in Figure 4-31b) from which we can obtain
(a) (b) Figure 4-31. (a) Data Matrix; (b) Correlation Matrix. the diagonal entries of the anti-image correlation matrix,
MSAi
r = r + a
j i j i 2 ij 2 ij j i
2 ij
124 of 179
r KMO = r + a
i j 2 ij i j 2 ij i j
2 ij
The KMO can be helpful in determining whether or not it will be helpful to perform a factor analysis on the variables. In particular, you can interpret this index in relation to performing a factor analysis as follows: Marvelous: Greater than 0.90 Meritorious: In the 0.80s Middling: In the 0.70s Mediocre: In the 0.60s Miserable: In the 0.50s Unacceptable: Below 0.5
To illustrate how to perform a factor analysis with STATA, consider the nations dataset and let us obtain the correlation matrix and the KMO index with the STATA commands corr and factortest respectively:
use http://www.ats.ucla.edu/stat/stata/examples/sws5/nations, clear corr pop-school3 factor pop-school3 estat kmo, novar These commands yield the results shown in Figure 4-32 and Figure 4-33. It can be observed that a number of variables are highly correlated; the KMO index indicates that the data is highly meritorious for performing a factor analysis.
Figure 4-32. The correlation matrix for variables in the nations dataset.
125 of 179
Figure 4-33.Results of estat, kmo command. In STATA, we can apply a number of factor extraction methods as an option of the factor command, namely: pf (principal factor) ipf (iterated principal factor) ml (maximum likelihood) pc (principal components)
factor pop-school3, pf
126 of 179
We can readily represent the variables (in standard form) in terms of the eight factors and a unique component. Thus, for instance, according to the STATA output and the factor analysis model, the (standardized form of the) population variable is:
pop = 0.01722 F1 -0.18702 F2 + 0.37855 F3 + 0.11405 F4 + 0.13907 F5 - 0.05462 F6 + 0.01500F6 + 0.01468F8 + 0.78566
Statisticians prefer principal components and maximum likelihood extraction methods, while Social Scientists prefer principal factor methods. For the latter, factors are inferred. Principal components factor (PCF) analysis is a factor analysis extraction method and is not equivalent to principal components analysis (PCA). The method called singular value decomposition (SVD) that is used to extract principal components in PCA is used to extract factors in principal components for factor analysis. SVD is applied to the correlation matrix in both PCA and PCF. But the two procedures differ in succeeding steps. In PCA, all the eigenvectors extracted by SVD are used in estimating the loadings of all p variables on all p principal components. In PCF, only eigenvectors corresponding to m << p factors are used in estimating the loadings. The procedure is such that the choice of m affects the estimates. Principal factor (PF) analysis extracts the eigenvectors from the reduced correlation matrix.The reduced correlation matrix is like the correlation matrix, but instead of 1s on the main diagonal, has estimates of the communalities on the main diagonal. Communalities refer to the variances that the X variables have in common. If the X variables have large communalities, a factor analysis model is appropriate. The estimate commonly used for the communality of the ith variable is the R2 obtained by regressing it on the other X variables. Regardless of the method employed, it is important to obtain simple and interpretable factors. Factor Loadings, which vary from 1.00 to +1.00, represent the degree to which each of the variables correlates with each of the factors. In fact, these factor loadings are the correlation coefficients of the original variables with the newly derived factors (which are themselves variables). Inspection reveals the extent to which each of the variables contributes to the meaning of each of the factors. You can use the following guide to interpret factor loadings: 0.40 important 127 of 179
0.50
practically significant
Note that: (a) an increase in the number of variables decreases the level of significance; (b) an increase in sample size decreases the level necessary to consider a loading significant; (c) an increase in the number of factors extracted increases the level necessary to consider a loading significant. To determine the number of factors that we can use, you can be guided by: Apriori criterion (that is, you can choose the number of factors prior to the analysis, perhaps guided by some theory or practical convenience) Latent root/Kaiser criterion which determines how many eigenvalues are
greater than 1; eigenvalues represent the amount of variance in the data described by the factors. Percent of variance explained A scree plot (elbow rule)
In the output of the command earlier, we see that the latent root criterion suggests the use of two factors. If we would be satisfied with a ninety percent cut off for the proportion of variation explained, then likewise, we would select two factors. (If we are satisfied with 80% variation explained, then one factor would suffice).
The scree plot (or elbow rule) is a plot of the number of factors against the cumulative proportion of variation explained. It is suggested that we find the value where the smooth decrease of eigenvalues appears to level off. This is the number of factors to be used. You can generate a scree plot in STATA by entering: greigen, yline(1) which yields the plot shown in Figure 4-34, and this plot suggests the use of two or three factors.
128 of 179
Figure 4-34. Scree plot. If the resulting factors are difficult to interpret and name, you may try rotating the factor solution. A rotational method distributes the variance from earlier factors to later factors by turning the axes about the origin until a new position is reached. The purpose of a rotated factor solution is to achieve a simple structure. In STATA, a rotated factor solution can be obtained (after generating the factor solution) if you enter the command rotate The default rotation is a varimax rotation, which involves an orthogonal rotation of the factors; other options include promax, factors, horst. Once you are comfortable with the factors, you ought to generate the factor scores (linear composites of the variables). Several methods can be employed for obtaining the resulting factor score coefficients. The default method in STATA, as in most software, is the regression factor score. The following commands generate the factor scores for three factors formed by standardizing all variables, and then weighting with factor score coefficients and summing for each factor; list the nations (and their factor scores), as well as generate a scatterplot of the first two factors. factor death-school3, factor(2) rotate predict f1 f2 label var f1 "demographics" label var f2 "development" list country f1 f2 sum f1 f2 correlate f1 f2 graph twoway scatter f2 f1, yline(0) xline(0) mlabel (country) mlabsize(medsmall)
129 of 179
The resulting plot (shown in Figure 4-35) illustrates a clustering of countries according to their economic development status. Developing countries are on the lower portion of the scatterplot, with the very poor countries on the lower right portion.
development 1
Denmark Australi Netherla Finland W_German Kuwait Belgium France U_K Japan Austria NewZeala Italy Ireland TrinToba Israel Spain Hungary Singapor Greece SierraLe Yugoslav HongKong Argentin Venezuel GuineaSomalia S_Korea Uruguay Jordan YemenAR Niger Egypt Senegal Peru Bolivia Algeria Mexico BurkFaso Portugal Liberia Chile Zaire Syria CenAfrRe Ecuador Panama Nigeria Madagasc Benin Brazil Jamaica Turkey Nepal Philippi Burundi IvoryCoa Pakistan Colombia Malaysia Zambia Morocco Banglade Haiti Nicaragu Tunisia CostaRic Ghana ElSalvad India DomRep Indonesi Togo Cameroon Honduras Kenya Mauritiu Botswana Guatemal Thailand Zimbabwe Paraguay SriLanka PapuaNG China Burma SauArabi -3 -2 -1 Scores for factor 1 0 1
is a time series. The daily market prices of a certain stock at the market closing over a 6-month period constitute a time series. Time series that are of interest primarily for economic analysis include the Gross Domestic Product, the Gross National Product, the Balance of Payments, the Unemployment levels and rates, the Consumer Price Index and the Exchange Rate Data. Time series can either be continuous or discrete. Frequently, continuous time series are discretized by measuring the series at small regular time intervals apart. Introduction to Methods for Research in Official Statistics 130 of 179
-1
Time may be viewed as just another variable that we observe together with some endogenous dependent variable, but time it is an ordered variable. Typically, we denote the time periods by successive integers. We may also consider a continuous time index but here, we anchor ourselves at some point t=0 in time, and we have a sense of past, present and future. The future cannot influence the past but the past can, and often does, influence the future. Time series analysis involves discovering patterns, identifying cycles (and their turning points), forecasting and control. Analysts are sometimes interested in long term cycles, which are indicators of growth measures, and so it is helpful if shorter length cycles are removed. Out-of-sample forecasts may be important to set policy targets (e.g. next years inflation rate can lead to a change in monetary policy). Time series models involve identifying patterns in the time series, formally describing these patterns and utilizing the behavior of past (and present) trends and fluctuations in the time series in order to extrapolate future values of the series. In model building, it is helpful to remember the words of time series guru George E.P. Box: All models are wrong, but some are useful. Several models can be used to describe various features of a time series. A distinguishing characteristic of a time series is autocorrelation (since data are taken on the same object over time). Estimated autocorrelations can be exploited to obtain a first impression of possible useful models to describe and forecast the time series. To use the time series commands in STATA, it is important to let STATA know that a time index is found in the data set. Suppose we want to read the Excel file dole.xls (partially displayed in Figure 4-36).
Figure 4-36. Screen shot of dole.xls Introduction to Methods for Research in Official Statistics 131 of 179
Note that this data file (representing the monthly total of people on unemployed benefits in Australia from January 1956 to July 1992) is available from the websitehttp://robjhyndman.com/forecasting/gotodata.htm. Make sure to save only the data, i.e. remove the second column information, and save the file in csv format, say into the c:\intropov\data folder After converting the file into CSV format, you can then read the file, generate a time variable called n (based on the internal counter _n ), employ the tsset command (to let STATA know that n is the time index), and get a time plot of the data with graph twoway line as shown : insheet using c:\intropov\data\dole.csv, clear rename v1 ts gen n=_n tsset n graph twoway line ts n which yields the time plot in Figure 4-37.
800000 0
0
200000
ts 400000
600000
100
200 n
300
400
Figure 4-37. Screen shot of dole.xls As was earlier pointed out, a time series may be thought of as a set of data obtained at regular time intervals. Since a time series is a description of the past, forecasting makes use of these historical data. If the past provides patterns, then it can provide indications of what we can expect in the future. The key is to model the process that generated the time series, and this model can then be used for generating forecasts. Thus, while a complete knowledge of the exact form of the model that generates the model is hardly possible, we choose an approximate
132 of 179
model that will serve our purposes of explaining the process that generated the series and of forecasting the next value of the series. We may investigate the behavior of a times series either in the time domain or in the frequency domain. An analysis of the time domain involves modeling a time series as a process generated by a sequence of random errors. In the frequency domain analysis of a time series, cyclical features of a time series are studied through a regression on explanatory variables involving sines and cosines that isolate the frequencies of the cyclic behavior. Here, we will consider only an analysis in the time domain. Thus, we assume that a time series to consists of a systematic pattern contaminated by noise (random error) that makes the pattern (also called a signal) difficult to identify and/or extract. Noise represents the components of a time series representing erratic, nonsystematic, irregular random fluctuations that are short in duration and non-repeating. Most time series analysis tools to be explored here involve some form of filtering out of the noise in order to make the signal more salient and thus amenable to extraction. Decomposing a time series highlights important features of the data. This can help with the monitoring of a series over time, especially with respect to the making of policy decisions. Decomposition of a time series into a model is not unique. The problem is to define and estimate components that fit a reasonable theoretical framework as well as being useful for interpretation and policy formulation. When choosing a decomposition model, the aim is to choose a model that yields the most stable seasonal component, especially at the end of the series. The systematic components in a time series typically comprise trends, seasonal patterns, and cycles. Trends are overall long-term tendencies of a time series to move upward (or downward) fairly steadily. The trend can be steep or not. It can be linear, exponential or even less smooth displaying changing tendencies which once in a while change directions. Notice that there is no real definition of a trend, but merely a general description of what it means, i.e., a long term movement of the time series. The trend is a reflection of the underlying level of the series. In economic time series, this is typically due to influences such as population growth, price inflation and general economic development. Introduction to Methods for Research in Official Statistics 133 of 179
When time series are observed monthly or quarterly, often the series displays seasonal patterns, i.e., regular upward or downward swings observed within a year. This is also not quite a definition but it conveys the idea that seasonality is observed when data in certain seasons display striking differences to those in other seasons, and similarities to those in the same seasons. Every February, for instance, we expect sales of roses and chocolates to go upward (as a result of Valentines day). Toy sales rise on December. Quarterly unemployment rates regularly go up at the end of the first quarter due to the effect of graduation. In Figure 438, we see 216 observations of a monthly series (representing a monthly Index from January 1976 December 1993 of French Total Industry Production excluding Construction) that comprise an upward trend and some seasonal fluctuations
Figure 4-38. Monthly Index of French Total Production excluding Construction (Jan 76 Dec 93). When a calendar year is considered the benchmark, the number of seasons is four for quarterly data, and 12 for monthly data. Cycles in a time series, like seasonal components, are also upward or downward swings but, unlike seasonal patterns, cycles are observed over a period of time beyond one year, such as 2-10 years, and their swings vary in length. For instance, cyclic oscillations in an economic time series may be due to changes in the overall economic environment not caused by seasonal effects, such as recession-and-expansion. In Figure 4-39, we see a monthly time series of 110 observations (representing Portugal Employment in Manufacture of Bricks, Tiles and Construction Products from January 1985 to February 1994). Notice a downward trend, irregular and short term cyclical fluctuations.
134 of 179
Figure 4-39. Portugal Employment in Manufacture of Bricks, Tiles and Construction Products (Jan 85 Feb 94).
These components may be thought of as either adding additively or multiplicatively to yield the time series. Aside from trends, seasonal effects, cycles and random components, a time series may have extreme values/outliers, i.e., values which are unusually higher or lower than the preceding and succeeding values. Time series encountered in business, finance and economics may also have Easter / Moving Holiday effects, and Trading Day Variation. A description of these components can be helpful to account for the variability in the time series, and for predicting more precisely future values of the time series. In addition to describing these components of a time series, it can be helpful to inspect whether the data fluctuations exhibit a constant size (i.e., nonvolatile, constant variance, homoskedastic pattern) or are the sizes of the fluctuations not constant (i.e., volatile, heteroskedastic)? It may be also helpful to see if there are any changes in the trend or shifts in the seasonal patterns. Such changes in patterns may be brought about by some triggering event (that may have a persistent effect on the time series). For such descriptive analysis of a time series, it is useful to employ the following tools: Historical plot a plot with time (months or quarters or years) on the x axis and the time series values on the y axis; Multiple time chart (for series observed monthly or quarterly or semestral or for time periods within the year) superimposed plots of the series values against time with each year being represented by one line graph; Table of summary statistics by periods, e.g., months 135 of 179
Denote a time series by Y t , which may have been observed at times t=1, 2, , n. Note that, in practice, the series to be analyzed might have undergone some transformation of an original time series say, a log transformation in order to stabilize the variance. A key feature of a time series data (in contrast with other data) is that the data are observed in sequence, i.e., at time t, observations also of previous times 1, 2, , t-1 are known but not those of the future. Also, because of autocorrelation, we can exploit information in the current and past to explain or predict the value of the future. Time series analysis often involves lagged variables, i.e., values of the same variable but from previous times, and leads, i.e, forward values of the series. That is, for a given a time series Y t , you may be interested in a first order lag
LYt = Yt 1
representing the original series but one period behind. To get a first order lag on the time series ts in STATA, either enter gen tsl1=ts[_n-1] or alternatively, enter gen tsl1=L.ts in the command window. To get the lag 2 (also called the second order lag) variable, enter gen tsl2=L2.ts Now to see the original time series, the first order and second order lags: list n ts tsl1 tsl2 Likewise, there we be interested in the first order lead
FYt = Yt +1
of a time series Y t representing the original series but one period ahead. To get a first order lead on the time series ts in STATA, either enter gen tsf1=ts[_n+1] or alternatively, enter gen tsf1=F.ts Sometimes, you may be interested in a first order difference Introduction to Methods for Research in Official Statistics 136 of 179
Yt = Yt Yt 1 = (1 L)Yt
for a given a time series Yt . To get a first order difference on the time series ts in STATA, enter gen tsd1=D.ts while the 2nd order difference
2Yt = Yt Yt 1 = (1 L) 2 Yt
is obtained with gen tsd2=D2.ts Since ts is a monthly series, then the seasonal difference is: gen tss12=S12.ts this is not a 12th-difference but rather a first difference at lag 12. This would yield
differences, say between Dec 2003 and Dec 2002 values, Nov 2003 and Nov 2002 values, and so forth. Recall that a distinguishing characteristic of a time series is that it is autocorrelated. The autocorrelation coefficients are the correlation coefficients between a variable and itself at particular lags. For example, the first-order autocorrelation is the correlation of Yt . and Yt 1 .; the second-order autocorrelation is the correlation of Yt . and Yt 2 ; and so forth. A correlogram is a graph of the autocorrelation (at various lags) against the lags. With STATA, we can generate the correlogram with the corrgram command. For instance, for the ts time series, you can get the correlogram with: corrgram ts, lags(80) This yields autocorrelations, the partial autocorrelations and Box-Pierce Q statistics for testing the hypothesis that all autocorrelations up to and including each lag are zero. Partial autocorrelation is an extension of autocorrelation, where the dependence on the intermediate elements (those within the lag) is removed. For the Box-Pierce Q statistics, small p values show significant autocorrelations. A more refined correlogram can be generated with: ac ts, lags(80) Introduction to Methods for Research in Official Statistics 137 of 179
138 of 179
= yt 1 ( yt 1 ) + 2 ( yt 2 ) + + p ( yt p ) +
t + 1 t 1 + 2 t 2 + + q t q
can be fit into a time series. ARMA(p,q) processes should be stationary, i.e., there should be no trends, and there should be a long-run mean and a constant-variance observed in the time series. To help guide us in identifying a tentative ARMA model to be fit on a stationary series, note the following characterizations of special ARMA models: AR(1) : ACF has an exponential decay; PACF has a spike at lag 1, and no correlation for other lags. AR(2): ACF shows a sine-wave shape pattern or a set of exponential decays; PACF has spikes at lags 1 and 2, no correlation for other lags. MA(1): ACF has spike at lag 1, no correlation for other lags; PACF damps out exponentially. MA(2): ACF has spikes at lags 1 and 2, no correlation for other lags; PACF - a sine-wave shape pattern or a set of exponential decays. ARMA(1,1): ACF exponentially decay starting at lag 1; PACF also has an exponential decay starting at lag 1.
In other words, we could summarize the signatures of AR, MA, and ARMA models on the autocorrelation and partial autocorrelation plots with the following table: Decay Cutoff
Decay Cutoff
This summary table can be helpful in identifying the candidate models. If we find three spikes on the partial autocorrelation plot (which may also look like a decay), then we could use try to fit either an ARMA model or an AR(3) model. For instance, the differenced series of ts is showing no trend, but the variability is smaller for the first half of the series than for the second half of the series: graph twoway line D.ts n graph twoway line D.ts n if n>220 so we instead look into analyzing only the second half of the series. There are decays of the autocorrelation and partial autocorrelation, so that, we may want to use an ARMA(1,1) on the differenced series: ac D.ts if n>220, lags(40) pac D.ts if n>220, lags(40) arima D.ts, arima(1,0,1) Introduction to Methods for Research in Official Statistics 139 of 179
or equivalently, use an ARIMA(1,1,1) on the original ts series: arima ts, arima(1,1,1) Alternatively, you can use the menu-bars: Statistics Time Series ARIMA models The underlying idea in Box-Jenkins models is that only past observations on the variable being investigated is used to attempt to describe the behavior of the time series. We suppose that the time-sequenced observations in a data series may be statistically dependent. The statistical concept of correlation is used to measure the relationships between observations within the series. Often we would like to smooth a time series to break down the data into a smooth and a rough component: Data = smooth + rough Then, we may want to remove the obscurities and subject the smooth series to trend analysis. In some situations, rough parts are largely due to seasonal effects and removal of the seasonal effects (also called deseasonalization) may help in obtaining trends. For instance, time series such as those shown in Figure 4-42 representing deseasonalized quarterly Gross Domestic Product (GDP) and Gross National Product (GNP) in the Philippines from the first quarter of 1990 to the first quarter of 2000 shows a steady increase in the economy except for the slump due to the Asian financial crisis. The effect though on GNP is not as severe as the effect due to GDP due to the income transfers from overseas.
(i) (ii) Figure 4-42. Gross Domestic Product (i) and Gross National Product (ii) during the first quarter 1990 up to first quarter 2000.
140 of 179
Smoothing involves some form of local averaging of data in order to filter out the noise. Smoothing methods can also provide a scheme for short term forecasting. The smoothing done above is rather elaborate. We discuss here more simple smoothing techniques (than seasonal adjustment methods). The most common smoothing technique is a moving average smoother, also called running averages. This smoother involves obtaining a new time series derived from the simple average of the original time series within some "window" or band. The result is dependent upon the choice of the length of the period (or window) for computing means. For instance, a three-year moving average is obtained by firstly obtaining the average of the first three years, then the average of the second to the fourth years, then the average of the third to the fifth years, and so on, and so forth. Consider the ts time series from the dole.xls data set. To obtain a 7-month moving average, you have to use the tssmooth command, specify that it is a moving average with all the seven points having the same weight and then graph it with: tssmooth ma ts7=ts, window(3 1 3) graph twoway line ts7 n, clwidth(thick) || line ts n, clwidth(thin) clpattern(solid) This results in Figure 4-43 which shows the rough parts of the original time series being removed thru the moving average smoothing.
800000 0 200000 400000 600000
100
300 ts
400
Figure 4-43. Original and smooth versions of ts. An alternative to entering the commands above is to select from the Menu-bar
141 of 179
Statistics Time Series Smoothers Moving Average Filter and consequently enter in the resulting popup window the information shown in Figure 4-44 regarding the new name of the variable (ts7) based on the original variable (or expression to smooth) ts with equal weights for the current (middle) portion of the series, the lagged and lead portions of the series.
Figure 4-44. Pop-up window for performing a moving average smoother. An alternative to moving averages (which provides equal weights to all the data in a window) is exponential smoothing, which provides most weight to the most recent observation. Simple exponential smoothing uses weights that decline exponentially; the smoothing parameter helps in obtaining the iterations for the smoothed data:
S= X t + (1 ) St 1 t
based on the original time series X t . You can ask STATA to obtain an exponentially smoothed series from the ts time series by tssmooth exponential tse = ts or by selecting on the menu-bar: Statistics Time Series Smoothers Simple exponential smoothing This will generate a pop-up window. Replies to the window are shown in Figure 4-45.
142 of 179
Figure 4-45. Pop-up window for exponential smoothing. The default option here is to choose the best value for the smoothing parameter based on the smallest sums of squared forecast error criterion. Exponential smoothing was actually developed by Brown and Holt for forecasting the demand for spare parts. The simple exponential smoother assumes no trend and no seasonality. Other exponential smoothers implemented in STATA, include:
double exponential Holt Model (w/ trend but no seasonality) Winters Models (assumes a trend and a multiplicative seasonality) nonlinear filters
Smoothing techniques are also helpful in generating short-term forecasts. Sometimes our goal in time series analysis is to look for patterns in smoothed plots. In other instances, the rough part or residuals are of more interest. gen rough= ts-tse label var rough "Residuals from exp smoothing" graph twoway line rough n The last command above generates the time plot shown in Figure 4-46.
143 of 179
60000
100
200 n
300
400
Figure 4-46. Time plot of residuals. Typically, the goal in time series analysis involves yielding forecasts. Smoothing techniques can help yield short-term forecasts. A rather very simple forecasting method entails using the growth rate:
144 of 179
We can alternatively use trend models that model time as an explanatory variable. You can run a linear trend model with reg ts n if n>220 predict forecast3 if n>220 To account for nonlinearity in the trend, you can try a quadratic trend model by entering gen nsq=n^2 if n>220 reg ts n nsq if n>220 predict forecast4 if n>220
If you believe that there is a strong seasonal effect that can be accounted for by monthly additive effects in addition to the quadratic effects, you can run the following: gen mnth=mod(n,12) tab mnth, gen(month) reg ts n nsq month1-month12 if n>220 predict forecast5 if n>220 Assessment of the forecasts may be done by inspecting the within-sample mean absolute percentage error (MAPE) of the forecasts: gen pe1=100*abs(ts-forecast1)/ ts gen pe2=100*abs(ts -forecast2)/ ts gen pe3=100*abs(ts -forecast3)/ ts gen pe4=100*abs(ts -forecast4)/ ts gen pe5=100*abs(ts -forecast5)/ ts mean pe1 pe2 pe3 pe4 pe5 which suggests that the forecasts using month-to-month growth rates giving the best forecasting performance . In addition to past values of a time series and past errors, we can also model the time series using the current and past values of other time series, called input series. Several different names are used to describe ARIMA models with input series. Transfer function model, intervention model, interrupted time series model, regression model with ARMA errors, and ARIMAX model are all different names for ARIMA models with input series. Here, we consider a general structural model relating a vector of dependent variables y with covariates X and disturbances with:
= yt X t + t
145 of 179
where we suppose that, the disturbances can itself be modeled in terms of an ARIMA process. You can still use the arima command for this model. How can you tell if it might be helpful to add a regressor to an ARIMA model? After fitting an ARIMA model, you should save the residuals of the ARIMA model and then look at their cross-correlations with other potential explanatory variables. If we have two time series, we can also explore relationships between the series with the cross-correlogram command xcorr. When modeling, you must be guided by the principle of parsimony: that is, you ought to prefer a simple over a complex model (all things being equal). We must realize also that the goodness of fit often depends on the specification, i.e., what variables are used as explanatory variables, what functional form is used for the variable being predicted, etc. Thus, what is crucial is to work on an acceptable framework for analysis, before one attempts to use any statistical models to explain and/or predict some variable.
146 of 179
The research report ought to explain concisely and clearly what was done, why it was done, how it was done, what was discovered, and what conclusions can be drawn from the research results. It should introduce the research topic and emphasize why the research is important. The report should describe the data, as well as explain the research methods and tools used; it ought to give enough detail for other researchers to repeat the research. Simple examples ought to be given to explain complex methodologies that may have been used. The clarity of the report should be on account of composition, reasoning, grammar and vocabulary. Booth, et al. (1995), point out that writing is a way not to report what is in that pile of notes, but to discover what you can make of it; they suggest that regardless of the complexity of the task of coming up with the research report, the plan for a draft report should consist of (a) a picture of who your readers are, including what they expect, what they know, what opinions they have; (b) a sense of the character you would like to project to your readers; (c) the research objectives (written in such a way as to also suggest gaps in the body of knowledge Introduction to Methods for Research in Official Statistics 147 of 179
you are addressing) and the significance of the study; (d) your research hypotheses, claims, points and sub-points; (e) the parts of the research paper. When should the report be started? Some researchers prefer to come up with some major results before commencing the writing process; others start it even immediately after the research proposal is finalized. Starting the writing process can be quite frightening, even to most researchers. Laws of physics suggest that when a body is at rest, it takes force to move the body. Similarly, it takes some energy to begin writing up a research report, but once it starts there is also some inertia that keeps the writing moving unless we get distracted. There are different ways of getting started in writing and in continuing to write. Booth et al. (1995) suggest the importance of developing a routine for writing: everyone has a certain time of day or days of the week when they are at their most creative and productive states, thus you ought to schedule the creative phase of your writing for these times. Other times may be more suitable for detailed work such as checking spelling and details of the argument. Microsoft Word can also be used for spell checking but beware that a correctly spelled word is not necessarily the right word to use. Booth et al. (1995) also stress the need to avoid plagiarism, which they define as: You plagiarize, when intentionally or not, you use someone elses words or ideas but fail to credit that person. You plagiarize even when you do credit the author but use his (or her) exact words without so indicating with quotation marks or block indentation. You also plagiarize when you use words so close to those in your source, that if you placed your work next to the source, you would see that you could not have written what you did without the source at your elbow. When accused of plagiarism, some writers claim I must have somehow memorized the passage. When I wrote it, I certainly thought it was my own. That excuse convinces very few. Some researchers make great progress in writing up the report with the aid of a structured technique. In this case, an outline is developed that identifies an overall structure of the draft research report in a hierarchical manner to arrive at the specifics of the report. Some people, however, find outlining a stifle to their creativity. An advantage to outlining is that the pieces of the report are conceived of before the report is written, and this may help give focus to drafting it.
The currently most popular word processing software package Microsoft Word supports outlining. Merely click on View, then select Outline. The Outline View displays the Word document in outline form. It uses heading styles to organize the outline; headings can Introduction to Methods for Research in Official Statistics 148 of 179
be displayed without the text. If you move a heading, the accompanying text moves with it. In outline view, you can look at the structure of the document as well as move, copy, and reorganize texts in the document by dragging headings. You can also collapse a document to see only the main headings, or expand it to see all headings and even the body text.
Typically, a research report would carry the following general topic outline: 1. Introduction 2. Review of Related Literature 3. Data & Methodology 4. Results & Discussion 5. Summary, Conclusions & Recommendations Such a skeleton may help in the early stage of a researchers thinking process. These topics could be further subdivided into sub-topics, and sub-sub-topics, and so forth. Elements of the research proposal, such as the objective, hypothesis, conceptual and operational framework, relevance of the research, may comprise the introduction section of the draft report, together with some background on the study. That is, the first section on Introduction could contain: 1.1. Background 1.2. Research Objectives 1.3. Research Hypotheses 1.4. Conceptual and Operational Framework 1.5. Significance of Study The third section on Data & Methodology might include a description of the data and a discussion of the data analysis method, including its appropriateness for arriving at a solution to test the operational hypothesis, so the section could be divided as follows: 3.1. Introduction 3.2. Data 3.2.1. Sampling design 3.2.2. Data processing 3.2.3. Data limitations 3.3. Data Analysis Methods Data can challenge the theory that guided their collection, but any analysis on poorly generated data will not be valid, thus the manner in which data was collected and processed has to be described. Statistical models help in distinguishing genuine patterns from random variation, but various models and methods can be used. It is important to discuss the analytical methods employed in order to see the extent of validity of research results. Introduction to Methods for Research in Official Statistics 149 of 179
When the research will require more synthesis, interpretation and analysis, a researcher may not necessarily have a clear sense of the results. The problem may not even be too clear, and in this case the act of drafting will itself help in the analysis, interpretation and evaluation. There may be a lot of moments of uncertainty and confusion. A review of the research questions and identification of key points may help put a structure to the report. It may also be important to combine a topic outline with a point-based outline that identifies points within each topic. Note that there may be more than one way to arrange elements and points of the report, but preferably go from short and simple to long and more complex, from more familiar to less familiar. Consider the following example of an outline of a research report that analyses results of a survey conducted by the Philippine National Statistics Office: Research Topic/Title: Contraceptive Use, Discontinuation and Switching Behavior in the Philippines: Results from the 2003 National Demographic and Health Survey Researchers: Socorro Abejo, Elizabeth Go, Grace Cruz, Maria Paz Marquez Outline I. Introduction: a. Increasing trends in contraceptive prevalence rates attributed to increasing use of modern methods b. last five years CPR unchanged (47% in 1998; 49% in 2003); one in four births mistimed c. some discontinuation of use of contraceptives due to method failure Significance of Study: understanding contraceptive dynamics an important input for family planning advocacy and information/education campaign Objectives of Study: a. To determine level and determinants of contraceptive method choice of women by selected background variables b. To determine twelve-month discontinuation rates and median duration of use by specific methods and by selected background variables c. To determine twelve-month switching rates by specific methods and by selected background variables; and d. To determine level and determinants of contraceptive use of men during the last sexual activity by selected background variables.
II.
III.
IV. Conceptual Framework: largely from Bulatao (1989) V. Data and Methodology a. Data Source: NDHS, mention support of ORC macro; sampling design, only some sections to be analyzed; calendar data file b. Unit of analysis: currently married who were not pregnant at time of interview or unsure about pregnancy c. Background Variables: basic socio-econ and demographic characteristics 150 of 179
d. Methodology: descriptive analysis (frequencies, cross-tabs); multinomial logistic regression and multiple classification analysis; life table technique VI. Results a. Contraceptive Practices of Women: i. trends in contraceptive use; ii. differentials in contraceptive prevalence iii. Determinants of Contraceptive Use iv. Contraceptive Discontinuation v. Contraceptive Switching Behavior b. Contraceptive Practices of Men i. Male Sexuality ii. Knowledge, attitude toward and ever use of contraception iii. Contraceptive prevalence during last sex iv. Differentials in method choice v. Differentials in type of method vi. Multinomial logit regression of male contraceptive use vii. Characteristics of male method users vs non users VII. Conclusions and recommendations: a. contraceptive practices among women associated with demographic factors; use is higher among older women, among women who are in a legal union; and among women who desire a smaller number of children; higher among women with higher socio economic status
b. positive effect of education on contraception use; c. need for special interventions to make FP supplies and services more accessible to poor couples d. degree of communication between woman and partner related to contraceptive practices e. poor attitude of contraceptive use among men f. extra marital practices or ex-nuptial relationships exposes people to risk; need for strengthening values g. differentials in contraceptive use and withdrawals could help FP managers in properly designing programs and strategies h. some problems with information; services especially for education needs improvement Such an outline is helpful, as it goes beyond topics and provides thoughts and relations among claims. In addition, here interrelations among the parts of the research report are conceptualized, especially in coming up with points, claims and organization of the arguments. Also, it is easy to observe whether issues need connectives that will bring together parts of the text and reveal their relations. After coming up with the outline, it is now time to go to the level of the paragraph. The boundaries between a paragraph and a sub-section are not sharp. Sub-sections allow you to go deeper into a topic, and require several related paragraphs. Within each sub-section, you may Introduction to Methods for Research in Official Statistics 151 of 179
want to use the topic sentence to define the paragraphs. The idea is to write a single sentence that introduces the paragraph, and leave the details of that paragraph for later. This expanded form of the outline may constitute the skeleton of the initial draft of the report. Outlining need not only be part of the pre-drafting stage. Sometimes, one comes up with a long draft report based on a general outline, and then an outline above may help further improve the draft by cutting out much of what was written (and even throwing away the entire draft). There may even be several iterations of this drafting and outlining before the research report is finalized. The point is to recognize that there will be changes in the outline, but it may be much more satisfactory and productive to modify a well-considered outline than to start without any plan whatsoever.
direct, unequivocal and persuasive writing is important in scientific technical reports. A good communicator must always understand that communication involves three components: the communicator, the message and the audience. Thus a good criterion for selecting a writing style is for the writer to think of the intended reader. The writer is challenged to keep it short and simple (KISS). Conciseness should, however, not mean that clarity is sacrificed. Booth et al. (1995) observe that: Beginning researchers often have problems organizing a first draft because they are learning how to write and at the same time, they are discovering what to write. As a consequence, they often lose their way and grasp at any principle of organization that seems safe. Claims will have to be supported and communicated not merely textually, but even with visual aids, such as tables, graphs, charts and diagrams. Visuals facilitate understanding of the research and the results; they can convince readers of your point, as well as help you discover patterns and relationships that you may have missed . Knowing what type of visual to use in a report is crucial. You must keep the end result the research report and the key point Introduction to Methods for Research in Official Statistics 152 of 179
of the research in focus. Depending on the nature of the data, some visuals might be more appropriate than others in communicating a point. Be aware that a visual is not always the best tool to communicate information; sometimes textual descriptions can provide a better explanation to the readers. Visuals must also be labeled appropriately. The title of a figure is put below the figure, while the title of a table is put above the table. These visuals must be located as close as possible to the textual description, which must make proper references to the visual. Readers ought to be told what you want them to see explicitly from a table or figure. Whether the draft (and the final report) will contain visual or textual evidence (or both) will depend on how readers can understand information and how you want your readers to respond to the information you present. Beware that most software packages, especially spreadsheets, may generate visuals that are aesthetically good but do not necessarily communicate good information. For instance, pie charts, though favorites in non-technical print media, are generally not recommended for use in research reports. Some psychological research into perceptions of graphs (see, e.g., Cleveland and McGill, 1984) suggests that pie charts present the weakest display among various graphs; they are quite crude in presenting information especially for more than four segments. Thus, it is recommended that pie charts be avoided in research reports although they may be used in presentations if you wish to have your audience see a few rough comparisons where differences are rather distinctive. Booth et al. (1995) suggests that a researcher ask the following questions when using visual aids: How precise will your readers want the data to be? Tables are more precise than charts and graphs (but may not carry the same impact) What kind of rhetorical and visual impact do you want readers to feel? Charts and graphs are more visually striking (than tables); they communicate ideas with more clarity and efficiency. Charts suggest comparisons; graphs indicate a story. Do you want readers to see a point in the data? Tables encourage data interpretation; charts and graphs make a point more directly.
153 of 179
According to Booth et al (1995), some people write as fast as possible and correct later, i.e., the quick and dirty way; others write carefully without leaving any problems, i.e., the slow and clean way; and, others, or perhaps most people, may be somewhere in between these extremes. In the latter case, it may be important to have notes of things that can be checked on later. Booth et al. (1995) also provide five suggestions that a researcher may use in drafting a report, viz: Determine where to locate your point: express main claim especially at the last sentence of introduction or in the conclusion. The same guideline ought to be used for major sections and sub-sections of the report. Formulate a working introduction: least useful working introduction announces only a topic; it is important to provide a background/context, and if you can, state the problem and even an idea of the solution. Most readers will not bother proceeding to read the report if the introduction is not well stated. Determine Necessary Background, Definitions, Conditions: decide what readers must know, understand or believe; spell out the problem in more detail by defining terms, reviewing literature, establishing warrants, identifying scope and delimitations, locating the problem in a larger context. But this summary MUST NOT dominate the paper. Rework Your Outline: rearrange elements of the body of an argument to make your points more organized: review what is familiar to readers, then move to the unfamiliar; start with short, simple material before getting into long, more complex material; start with uncontested issues to more contested issues. Note that these suggestions may pull against each other. Select and Shape Your Material: research is like gold mining: a lot of raw material may be dug; a little is picked out and the rest is discarded. You know you have constructed a convincing argument when you find yourself discarding material that looks goodbut not as good as what you keep. Ensure that your presentation of material highlights key points, such as patterns evident from your data. Introduction to Methods for Research in Official Statistics 154 of 179
The overall plan is to have a research paper with a flow in its structure involving its various sections, viz., the introduction, data and methods section, results and discussions, conclusions and recommendations. The section on results and discussions is the main portion of the research report. It ought to contain answers to the research questions, as well as the requisite support and defense of these answers with arguments and points. Conflicting results, unexpected findings and discrepancies with other literature ought to be explained. The
importance of the results should be stated, as well as the limitations of the research undertaking. Directions for further research may also be identified. A good research report transfers original knowledge on a research topic; it is clear, coherent, focused, well argued and uses language without ambiguities. The research report must have a well-defined structure and function, so that other researchers can repeat the research. In achieving this final form of the research report, you will first have to come up with the draft. The first draft of the technical report need not necessarily use correct English or a consistent style. These matters, including correct spelling, grammar and style can be worked out in succeeding drafts, especially the final draft. One must merely start working on the draft, and keep moving until the draft is completed. It may even be possible that what is written in a draft may not end up in the final form of the report. Thus, it is important to start drafting as soon as possible, and it is generally helpful not to be too much of a perfectionist on the first draft. Booth et al. (1995) suggests that some details on the draft be left for revisions: if you stumble over a sentence, mark it but keep moving. If paragraphs sound disconnected, add a transition if one comes quickly or mark it for later. If points dont seem to follow, note where you became aware of the problem and move on. Unless you are a compulsive editor, do not bother getting every sentence perfect, every word right. You may be making so many changes down the road that at this stage, there is no point wasting time on small matters of style, unless, perhaps you are revising as a way to help you think more clearly Once you have a clean copy with the problems flagged, you have a revisable draft.
enable you to more easily see mistakes in spelling, punctuation, and grammar. When using a word processor, it may be difficult to sometimes catch mistakes; printing a hard copy of the draft may be useful in revising the draft. Revisions involve some level of planning and diagnosing. The process involves identifying the frame of the report: introduction and
conclusion (each of should carry a sentence that states the main claim and solution to the problem); as well as the main sections of the body of the paper, including beginning and final parts of each of these sections. In addition, it is important to analyze the continuity of the themes in the paper, as well as the overall shape and structure of the paper. Repetitions of words in the same paragraph, especially in consecutive sentences generally ought to be avoided. Verbosity, i.e. wordiness, and use of redundant words ought also to be avoided. Arguments have to be sound: it is one thing to establish correlation; it is another to establish causality. Variables investigated may be driven by other variables that confound the effect of the correlation between the investigated variables. Remove excessive detailed technical information and the details of computer output and put them instead in an appendix. It may be helpful to read and re-read your draft as if it were written by someone else. The key is to put oneself into the shoes of a reader, i.e. to see the draft from and through the eyes of a reader imagining how they will understand it, what they will object to, what they need to know early so they can understand something later. (Booth et al., 1995). This will entail diagnosing mistakes in organization, style, grammar, argumentation, etc., as well as revising the text to make it more readable to a reader. It may help to ask yourself a few questions, such as: Is the introduction captivating (to a reader)? Are the text structure, arguments, grammar and vocabulary used in the draft clear? Does the text communicate to its readers what you want it to? That is, can the reader find what you wanted to say in the draft? Are the visuals, e.g., graphs, charts, and diagrams communicating the story of the research effectively? Does the draft read smoothly? Are there connectors among the various ideas and paragraphs? Is there a flow to the thoughts? Are they coherent? Is the draft as concise as possible? Are there redundant thoughts? Can the text be shortened? 156 of 179
Does the draft read like plain English? Is there consistency between the introduction and conclusion? Were the research objectives identified actually met?
Booth et al. (1995, p. 232-233) provide some concrete and quick suggestions for coming up with revisions to the writing style in the draft. They are presented below in toto: If you dont have time to scrutinize every sentence, start with passages where you remember having the hardest time explaining your ideas. Whenever you struggle with content, you are likely to tangle up your prose as well. With mature writers that tangle usually reflects itself in a too complex, nominalized style. FOR CLARITY: Diagnose 1. Quickly underline the first five or six words in every sentence. Ignore short introductory phrases such as At first, For the most part, etc. 2. Now run your eye down the page, looking only at the sequence of underlines to see whether they pick out a consistent set of related words. The words that begin a series of sentences need not be identical, but they should name people or concepts that your readers will see are clearly related. If not you must revise. Revise 1. Identify your main characters, real or conceptual. They will be the set of named concepts that appear most often in a passage. Make them the subject of verbs. 2. Look for words ending in tion, -ment, -ence, etc. If they appear at the beginning of sentences, turn them into verbs. FOR EMPHASIS: Diagnosis 1. Underline the last three or four words in every sentence. 2. In each sentence, identify the words that communicate the newest, the most complex, the most rhetorically emphatic information; technical-sounding words that you are using for the first time; or concepts that the next several sentences will develop. Revise Revise your sentences so that those words come last.
157 of 179
In looking through the draft, you ought to avoid colloquialisms and jargons; pay attention to using proper, unambiguous words. Try also to use a dictionary/thesaurus, especially if there is one in your word processor. This may help in remedying repetition of words in a paragraph. Spell-checks also ought to be used, but be careful that these may not always yield the correct words you want. When introducing technical and key terms in the research report, it is preferred to structure the sentence so that the term appears in the last words. The same may be true of a complex set of ideas, which ought to be made as readable as possible. Non-native writers of English have the tendency to write in their native language and then translate their thoughts into English. This tends to be too much work unless only notes are made rather than full sentences and texts. Even native speakers and writers of English are not spared of problems in communication because of the rich grammatical constructions in English. Grammar essentially involves linking, listing and nesting words together to communicate ideas. Sentences consist of coordinated words, and clauses embedded and glued together. Booth et al. (1995) suggests the importance of grammar in a research report: Readers will judge your sentences to be clear and readable to the degree that you can make the subjects of your verbs name the main characters in your story. To assist in the revision of the structure and style of the draft research report, we provide in the Appendix of this manual a review of a few grammar concepts and principles.
introduction, and conclusion ought to be clear, logical, coherent, and focused. The executive summary should be coupled with good arguments and well-structured writing to enable the paper to get widely read, and possibly published. The introduction and the conclusion likewise ought to be well written as well as coherent in content. The key points stressed in the introduction should not conflict with those in the conclusion. The introduction may promise that a solution will be presented in the concluding section. Because of the importance of the Introduction to Methods for Research in Official Statistics 158 of 179
introduction and conclusion, as well as the executive summary, some writers prefer to write these last. Even in this case, a working introduction and working conclusion will still have to be initially drafted.
It is suggested that title contain only seven to ten words and avoid use of complex grammar. Preferably, the title should be intrigue readers and attract interest and attention. Booth et al (1995) suggest that the title be the last thing you should write a title can be more useful if it creates the right expectations, deadly if it doesnt Your title should introduce your key concepts... If your point sentence is vague, you are likely to end up with a vague title. If so, you will have failed twice: You will have offered readers both a useful title and useless point sentences. But you will also have discovered something more important: your paper needs more work
159 of 179
They also provide an example of an introduction that states the context, the research problem and the sense of the outcome: First born middle-class native Caucasian males are said to earn more, stay employed longer and report more job satisfaction. 5 But no studies have looked at recent immigrants to find out whether they repeat that pattern. If it doesnt hold, we have to understand whether another does, why it is different, and what its effects are, because only then can we understand patterns of success and failure in ethnic communities.6 The predicted connection between success and birth order seems to cut across ethnic groups, particularly those from South-east Asia. But there are complications in regard to different ethnic groups, how long a family has been here, and their economic level before they came. 7 The following example is taken from an introductory section of a draft working paper by Nimfa Ogena and Aurora Reolalas on Poverty, Fertility and Family Planning in the Philippines: Effects and Counterfactual Simulations: With the politicization of fertility in many countries of the world, the impact of population research has expanded beyond mere demographics towards the wider socio-cultural and political realm. The raging debates during the past year in various parts of the country on the population and development nexus, at the macro level, and why fertility continues to be high and its attendant consequences, at the individual and household levels, created higher visibility for this ever important issue and legitimized certain influential groups. Nevertheless, with the dearth of demographic research in the country over the past decade, debates have very limited findings and the empirical evidence to draw from to substantiate basic arguments Correlations are not sufficient basis for arguing that fertility induces poverty or poverty creates conditions for higher fertility. Better specified models are needed to examine causal linkages between fertility and poverty. This study aims to : (1) analyze regional trends, patterns and differentials in fertility and poverty; (2) identify factors influencing Filipino womens poverty status through their recent fertility and contraceptive protection and; (3) illustrate changes in the expected fertility and poverty status of women when policy-related variables are modified based on fitted structural equations models (SEM). Expected to be clarified in this study are many enduring questions such as: Does having a child alter a womans household economic status? How does the poverty situation at the local level impact on a womans fertility and household economic status?
5 6
Shows Context. States the Research Problem. 7 Indicates a sense of the outcome.
160 of 179
As fertility declines, how much reduction in the poverty incidence is expected? If unmet need for family planning is fully addressed, by how much would fertility fall?
Methodologically, this is the first time that the fertility-poverty status linkage is examined at the individual level with a carefully specified recursive structural equations model (SEM) that accounts for the required elements to infer causality in the observed relationships. This paper also hopes to contribute to current policy debates by providing different scenarios to illustrate possible shifts in womens fertility and poverty status as selected policy-related variables are modified. Notice that the authors make available the background of the research problem, the research objectives and questions to be tackled, and the significance of the study.
The concluding section may offer a summary of the research findings. It ought to show that the key objectives and research questions were solved. All conclusions stated ought to be based on research findings; some suggestions for further research may be stated, recognizing what is still not known. A closing quote or fact may be presented. Regardless of whether we do so or not, the conclusion should be in sync with the introductory section.
Booth et al. (1995, p. 250-254) offers a couple of quick tips of dos and donts in the choice and use of first and last words: Your FIRST FEW WORDS: Dont start with a dictionary entry: Webster defines ethics as Dont start grandly: The most profound philosophers have for centuries wrested with the important question of Avoid This paper will examine I will compare Some published papers begin like that, but most readers find it banal. Three choices for your first sentence or two Open with a Striking Fact or Quotation (but) only if its language leads naturally into the language of the rest of your introduction. Open with a relevant anecdote (but) only if is language or content connects to your topic.
Open with a general statement followed by more specific ones until you reach your problem.
161 of 179
If you open with any of these devices, be sure to use language that lead to your context, problem and the gist of your solution.
Your LAST FEW WORDS: Not every research paper has a section titled Conclusion, but they all have a paragraph or two to wrap them up. Close with your main point especially if you ended your introduction not with your main point but with a launching point. If you end your introduction with your main point, restate it more fully in your conclusion. Close with a New Significance or Application (which) could earlier have answered the question So what? but perhaps at a level more general than you wanted to aim at. If your research is not motivated directly by a practical problem in the world, you might ask now whether its solution has any application to one.
Close with a call for more research if the significance of your solution is especially interesting. Close with a coda, a rhetorical gesture that adds nothing substantive to your argument but rounds it off with a graceful close. A coda can be an apt quotation, anecdote, or just a striking figure of speech, similar to or even echoing your opening quotation or anecdote one last way that introductions and conclusions speak to each other. Just as you opened with a kind of prelude, so can you close with a coda. In short, you can structure your conclusion as a mirror image of your introduction.
conclusions and recommendations. The executive summary is an expanded version of an abstract, which typically is not more than 100 to 150 words, containing the context, the problem, and a statement of the main point, including the gist of the procedures and methods used to achieve the research results.
163 of 179
The reality is that not all people who would attend a dissemination forum of your research will read your report. Thus, your work and your research report will likely be judged by the quality of your presentation. The primary purpose of a presentation is to educate, to provide information to the audience. If your presentation is poorly expressed, then your research results will be poorly understood. In this case, your research will be in danger of being ignored, and your effort working on your research may be put to waste. Too many details will likely not be remembered by the audience, who might just go to sleep, and worse, snore during your presentation! Introduction to Methods for Research in Official Statistics 164 of 179
Unlike the written research report, a presentation is a one-shot attempt to make a point or a few points. Thus, it is vital that your presentation be well-constructed and organized, with your points submitted in a logical, clear sequence. Often, your presentation may have to be good only for ten to fifteen minutes. The shorter the talk, the more difficult it will generally be to cover all the matters you wish to talk about. You cant discuss everything, only some highlights, thus you need to be rather strict about including only essential points for the presentation, and removing all non-essential ones. Try to come up with a simple presentation. Of course, planning the content of a talk in relation to the actual length of the talk can be rather difficult. It may be helpful to note that if you use about 100 words per minute, and each sentence covers about 15 words, and each point will carry about 4 sentences then a 15 minute talk will roughly entail 25 concepts (and 25 slides, at 1 concept per slide). If you are given more than 30 minutes for your presentation, while you have more time to cover material, you have to find ways of enlivening your presentation as peoples attentions are typically shortlived. Also, in this case, there is danger that your audience may be more attentive to your assumptions and question them. Whether your talk will be short or long, as a rule of thumb, people can only remember about five or fewer new things from a talk.
165 of 179
It is suggested that before your talk, you should create an outline of your talk as well as plan how to say what you want to say. It should be clear to you whether you are expected to present new concepts to your audience, or build upon their knowledge. That is, have people not read your paper ( in which case, how can you persuade them to do so) or have they read it ( in which case, what specific point do you wish them to appreciate). Either way, the basics of your talk the topic, concepts and key results ought to be delivered clearly, and early on during the talk, to avoid losing the attention of the audience. You ought to identify the key concepts and points you plan to make. Determine which of these concepts and points will require visuals, e.g., graphs, charts, tables, diagrams and photos that will convey consistency with your message. Preferably, these visuals should be big, simple and clear. The intention of using such visuals is to provide insights and promote discussions. Although presenting data in tables can be effective, these tables should not be big tables. Only small tables with data pertinent to the point you want to deliver should be used. Figures may often be more effective during presentations. All preparations for such visuals ought to be done way in advance.
If you will be presenting your report with the aid of a computer-based presentation, such as Powerpoint, make sure to invest some time learning how to use this software. Powerpoint is an excellent tool for organizing your presentation with slides and transparencies. Inspect the various templates, slide transitions and custom animation available. While colorful templates may have their use, you ought to be prudent in your choice of templates, backgrounds and colors. Artwork, animation and slide transitions may improve the presentation, but dont go overboard. Too much animation and transitions may reduce the visuals effectiveness and distract your audience from your message. Using cartoons excessively may make your presentation appear rather superficial. Thus, make sure to be artistic only in moderation. Try not to avoid losing sight that you have an overall message to deliver to your audience the key point of your research. You are selling your research! Artwork will not substitute for content. The earlier you start on your slides, the better they will be, especially as you fine tune them. However, avoid also excessive fine tuning. Be conscious that your slides are supports and guides for your talk. You also should cooperate with the content of your slides. Your first or first two slides should give an overview Introduction to Methods for Research in Official Statistics 166 of 179
(or outline) of the talk. Your slides ought to be simple and informative; with only one basic point in each slide. If you intend to give a series of points, it may be helpful to organize them from the most to the least important. This way, your audience is more likely to remember the important points later. Establish logical transitions from one point to another that link these issues. You may, for instance, pose a question to introduce the next point. Make sure to put the title, location, and date of the talk on each slide; run a spelling check on the words in your slides. It can be embarrassing if people pay more attention to mistakes in spelling in the slides than to the issues you are attempting to communicate to your audience. Some standard donts in using Power-point presentations: Dont use small fonts. Make sure that your slides are readable. As a rule of thumb, if it looks just about right on the computer screen, then it is likely
to be too small during the presentation. If it looks big, it may still be small during the slide view. You should intend to see outrageously large fonts (and that goes for figures, charts, etc). To do a simulation, put your slides on slide show, and step back six feet away from your computer. You ought to be able to read all the text very easily in your presentation; if not you ought to resize your fonts. You shouldnt write too much on each slide as the fonts get automatically smaller if there is too much written. Try having only four or five lines per slide with not more than six words per line. Preferably have the size of the smallest font used be around 36 point. Dont use pale colors, e.g. yellow (about one in ten people are said to be color blind, these people cannot see yellow). Use highly saturated colors instead of pastels, and complementary colors for your text and background to increase the visibility of the text in the slide. Color increases visual impact. Dont use only upper case (except for the presentation title). A mixture of lower and upper cases tends to be easier to read than purely upper or purely lower case.
167 of 179
Dont use complete sentences. Better to only have key phrases, and use bullets. If you use sentences, use only short sentences with simple constructions.
Dont overcrowd slides, e.g., do not use very large tables copied directly from an MS word document; dont put too many figures or charts on one slide.
Dont change formats. Once you have made your choice on colors, fonts, font size, etc., you ought to stick to them. Consistency in format will allow your audience not to be distracted by anything other than your message.
Dont display slides too quickly. But dont spend too much time on one slide either. The audience can scan the content of a slide (including visuals) within the first three seconds after it appears. If you don't say anything during this period, you allow your audience to absorb the information, then, when you have their attention, you can expand upon what the slide has to say.
Dont ever read off everything directly from your slides. The audience can read! If you were to read off directly everything from your slides, then people could just be given copies of the slides without you needing to give the presentation. You need to say some words not stated in your slides, e.g., elaborations of your points, anecdotes, and the like. You ought to maintain eye contact with your audience, preferably almost always.
Finally, stage your presentation run through your talk at least once. A practice talk is likely to be about 25% faster than the actual presentation. Re-think the sequencing of issues and reorganize it to make the talk run more evenly. Rephrase your statements as needed. Delete words, phrases, statements and issues that may be considered non-essential bearing in mind time constraints as well as flow of ideas in your talk. Try to target having some time left over for questions at the end of the talk. You may want to prepare a script or notes for your presentation to put your slides at sync with what you want to say. The script or notes can be quite useful, especially if you go astray during the presentation. It puts order to what you will say, but this has to be organized well. Practice your presentation, perhaps first in private. Listen to the words you use as well as how you say them, and not to what you may think you are saying. Rehearsing your talk in Introduction to Methods for Research in Official Statistics 168 of 179
front of a mirror (making eye contact with an imaginary audience) can be a painful and humbling experience, but also rather helpful as you observe your idiosyncrasies and mannerisms, some of which may be distracting to your audience. After your private rehearsal, you can try your presentation out in front of some colleagues, preferably some of whom should not know too much about your topic and research. These colleagues can provide constructive feedback, both on the content and style of presentation. Make the changes. Rehearse some more (as the saying goes practice makes perfect) perhaps around five times, then let it rest (and even sleep on it). The night before the talk, sleep well.
169 of 179
future work. Then, finally end your talk with some few words of thanks. An acknowledgement slide may be helpful either at the very end or at the front end of the talk. If you are interrupted during a conference or seminar, you can answer without delay but dont lose control over the flow of your presentation. You ought to avoid being sidetracked or even worse, being taken over by the distraction. You can opt to defer your reaction to the interruption by mentioning that this will be discussed later or at the end of the talk. If there are questions raised during your presentation, it is good practice to repeat the question not only for your benefit and that of the person raising the question, but also for the sake of others. Take some time to reflect on the question. If there are questions about the assumptions in your research, you will have to answer in detail the questions. Thus be prepared and anticipate the audience reactions. If there are questions about a point you made, you will have to discuss this point and re-express it clearly. If the audience is largely composed of non-specialists, you may want to delay your discussion until the end of your presentation or have a private discussion. Technical questions will have to be provided technical answers. If you dont understand the question raised or if the question is rather challenging, you may honestly say so (but no need to apologize). Here, you may want to offer to do further research on the issue, ask suggestions from the floor or have further discussions after your presentation. If you sense that your answer and discussion is getting prolonged, make efforts to get out of the heat tactfully. If you run out of time in your presentation, you may finish your current point, refer the interested reader to further details in the research paper and jump to the concluding section of your presentation.
170 of 179
Past: used for events already in the past when the text was written. A calendar data file was created by extracting calendar information from the 2003 women data file using the Dynpak software package of programs It is often used in the Methodology section of the research report. It is also used for a result that is specific to the research undertaking. (If the statement is a statement of fact, in general, the present tense is used).
Present: used for statements that are always true, according to the researcher, for some continuing time period; the statement may have been false before, but it is true at the time of writing and for sometime henceforth.
171 of 179
The 2003 NDHS is the first ever National Demographic Survey which included male respondents. It is also used for a result that is widely-applicable, not just to the research. Better educated, working women are more likely to have a higher level of contraceptive use. Past Perfect: used for events already in the past when another event in the past occurred. Future Perfect: used for future events that will have been completed when another event in the future occurs. II. Voice of a Verb: refers to whether the subject of the sentence performs the action expressed in the verb (the so-called active voice) A researcher presented his research proposal. or whether the
subject is acted upon (the so-called passive voice), i.e., the agent of action appears in the phrase by the: Example: The research proposal was presented by the researcher. In the passive voice, sometimes there is no explicit subject (no by the phrase), only an implied one in the complement. Note that in the last example, it is not actually necessary to have the phrase by the researcher to make the sentence a complete sentence. Writers are advised to avoid dangling modifiers caused by the use of passive voice. A dangling modifier is a word or phrase that modifies a word not clearly stated in the sentence. Instead of writing To cut down on costs, the survey consisted of 1000 respondents one should write To cut down on costs, the team sampled 1000 respondents In the first case, the construction of the sentence indicated that the survey was the one that decided to cut down costs.
There is wide acceptance for the use of the active voice in non-scientific writing over the passive voice as the active voice yields clearer and direct thoughts. The passive voice ought to be used, however, when the object is much more important than the subject, especially when the identity of the subject does not matter. It is widely regarded to use the active voice. Introduction to Methods for Research in Official Statistics 172 of 179
Here it is more rhetorically effective to use an indirect expression. The passive voice may also help avoid calling attention to oneself, i.e., rather than write: I selected a sample of 1000 respondents. You may choose to write A sample of 1000 respondents was selected. Another alternative way of writing the sentence is to use a third-person style, i.e. use the researcher in place of I: The researcher selected a sample of 1000 respondents. III. Punctuation: refers to breaking words into groups of thoughts, as signals to readers that another set of thoughts is to be introduced. These punctuation marks make it is easier for the reader to understand the flow of thoughts of a writer, to emphasize and clarify what a writer means. The rules for the use of a number of punctuation marks are fairly standard although they are not static, i.e., they have changed through the years. These conventions and rules are created and maintained by writers to help make their text more effective in communicating ideas.
The period (.) completes a sentence; while the semi-colon (,) joins two complete sentences where the second sentence is rather related to the first. An exclamation point (!) also ends a sentence, but, unlike the period, it indicates surprise or emphasis. A question mark (?) also ends a sentence, but it proposes a question which could be answered, either by the reader or the researcher/writer. (Note that rhetorical questions need no answer.)
The colon (:) is placed at the end of a complete sentence to introduces a word, a phrase, a sentence, a quotation, or a list. The National Statistics Office has only one major objective: to generate statistics that the public trusts. The most common way of using a colon, however, is to introduce a list of items.
173 of 179
The regression model involved explaining per capita income of the family with a number of characteristics of the family: family size, number of working members in the household, years of education of household head, and amenities. Note that the colon should not be placed after the verb in a sentence, even when you are introducing something, because the verb itself introduces the list and thus, the colon would be redundant. If you are not sure whether you need a colon in a particular sentence, try reading the sentence, and when you reach the colon, substitute the word namely. If the sentence reads well, then you may need the colon. (Of course, there are no guarantees!).
The comma (,) can be used more variedly: it can separate elements of a list, or it can join an introductory clause to the main part of a sentence. Some writers can tell where a comma is needed by merely reading their text aloud and inserting a comma where there is a need for a clear pause in the sentence. When a reader encounters a comma, the comma tells the reader to pause (as in this sentence). In the preceding sentence, the comma is used to join the introductory clause of the sentence (When a reader encounters a comma) to the major part of the sentence (the comma tells the reader to pause). Another reason for the pause could be that the words form a list, and the reader must understand that the items in the list are separate. What is often unclear is whether to include the comma between the last and second-to-last items in a list. In the past, it was not considered proper to omit the final comma in a series, modern writers believe that conjunctions such as and, but, and or do the same thing as a comma and these writers argue that a sentence is more economical without the comma. Thus, you actually have the option to choose whether or not to include the final comma. Many writers, however, still follow the old rule and expect to see the final comma. Note also that while we can use a semicolon to connect two sentences, more often we glue two sentences together with a comma and conjunction.
174 of 179
A regression model was run on the data, and then model diagnostics tools were implemented. If your sentence is rather short (perhaps between five and ten words), you may opt to omit the comma. The comma may also be used to attaching one or more words to the front or back of the core sentence or when you insert a group of words into the middle of a sentence. If the group of words can be viewed as non-essential, the comma will have to be put on both sides as in the following example. The poverty data, sourced from the 2003 Family Income and Expenditure Survey, suggests a fall in the percentage of poor people (compared to the previous survey conducted three years ago). For more grammar bits and tips, you may want to look over various sources from the internet, such as http://owl.english.purdue.edu/handouts/grammar/
175 of 179
List of References
Albert, Jose Ramon G. (2008). Statistical Analysis with STATA Training Manual. Makati City: Philippine Institute for Development Studies. Alley, Michael. (2003).The Craft of Scientific Presentations. New York: Springer-Verlag. Asis, Maruja M.B. (2002). Formulating the Research Problem, Reference Material in the University of the Philippines Summer Workshops on Research Proposal Writing. Babbie, Earl (1992). The Practice of Social Research ( 6th ed.), California: Wadswoth Publishing Co. Booth, Wayne C., Gregory G. Colomb, and Joseph M. William. (1995) The Craft of Research. Illinois: University of Chicago. Cleveland, W. S. and R. McGill. (1984), Graphical Perception: Theory, Experimentation and Application to the Development of Graphical Methods, Journal of the American Statistical Association, 79, 531-554. Draper, Norman and Harry Smith (1998) Applied Regression Analysis, New York: John Wiley. Gall, Merdith, D., Walter R. Borg, Joyce P. Gall (2003) Educational Research: An Introduction, 7th edition. New York: Pearson. http://owl.english.purdue.edu/handouts/grammar/ Haughton, Jonathan, Shahidur R. Khandker (2009). Handbook on Poverty and Inequality. World Bank: Washington Hoff, Darrel (1954). How to Lie with Statistics. New York: W.W. Norton Hosmer, David, W., Jr., and Stanley Lemeshow. (2000). Applied Logistic Regression, Second Edition. New York: John Wiley & Sons. Kuhn, Thomas S. (1970). The Structure of Scientific Revolutions, Second Edition. Chicago: University of Chicago Press. Meliton, Juanico B.(2002). Reviewing Related Literature, Reference Material, University of the Philippines Summer Workshops on Research Proposal Writing. Mercado, Cesar M.B. (2002). Overview of the Research Process, Reference Material, University of the Philippines Summer Workshops on Research Proposal Writing. Mercado, Cesar M.B. (2002). Proposal and Report Writing, Reference Material, University of the Philippines Summer Workshops on Research Proposal Writing. Nachmias, Chava F. and David Nachmias (1996). Research Methods in the Social Sciences (5th ed.), New York: St. Martins Press. Phillips, Estelle M. and Derek S. Pugh (2000). How to Get a Ph.D.: A Handbook for Students and Their Supervisors (3rd ed), Philadelphia: Open University Press. Stevens, S. (Ed.) (1951) Handbook of Experimental Psychology. NY: Wiley. Statistical Research and Training Center (SRTC) Training Manual on Research Methods. University of the Philippines Statistical Center Research Foundation Training Manual on Statistical Methods for the Social Sciences. World Bank (2000) Integrating Quantitative and Qualitative Research in Development Projects. Michael Bamberger (ed).
176 of 179
Research objectives need to be SMART (Specific, Measurable, Achievable, Relevant, and Time-bound) to ensure the research process is structured, focused, and feasible. Clearly defined objectives help in outlining a manageable scope, guiding data collection, and ensuring that the research aligns with key questions and hypotheses . Poorly formulated objectives can lead to unnecessary and excessive data collection, making the research project difficult to manage and potentially resulting in arbitrary and invalid conclusions. This lack of focus can also lead to gathering irrelevant data, which increases workload and complicates analysis . Consequently, researchers may struggle to achieve clear and actionable findings, leading to wasted resources and time ."}]}⟩_ASSISTANT_SAFEjson_TOOLS<|vq_2586|>{
Researchers can communicate effectively by understanding the variability in audience interest and expectations, crafting their reports for clarity, and using a path-like narrative to guide readers through the research journey . They should also consider different dissemination methods, like reputable journals, and ensure the research contradicts or supports the readers' beliefs usefully. Additionally, using visuals and simplifying complex data during presentations enhances understanding .
The selection of literature review sources affects a research project by establishing the originality of the research, avoiding unintentional duplication, and providing theoretical and conceptual frameworks . It can guide topic selection, methodology ideas, and result interpretation, thus influencing research refinement and extension of current knowledge . The selection involves subjective judgments of importance and validity, which contextualizes the project, aiding in problem identification and hypothesis formulation . Utilizing high-quality, peer-reviewed sources is crucial as it provides a reliable foundation for research . Furthermore, a well-organized and systematic literature review ensures coherence and relevance to the study’s objectives, hypotheses, and methodological approaches .
Public presentations offer several benefits for disseminating research findings, including the opportunity for direct interaction with the audience, which can lead to immediate feedback and discussions. This can enhance understanding and potentially refine research results . Presentations also enable researchers to reach a wide audience, potentially raising the visibility and impact of their work . However, there are risks involved, such as the potential for misinterpretation of data if the findings are not communicated clearly and accurately . Additionally, the pressure of public speaking can lead to oversimplification of complex data to ensure clarity for a broader audience .
The least squares method in regression analysis is used to estimate the parameters of a regression line by minimizing the sum of the squares of the vertical distances of data points from the fitted line. This method provides estimates of the intercept and slope that describe the relationship between dependent and independent variables, typically in a linear fashion. In practice, the regression line derived from least squares is considered the best fitting line for the dataset, providing predictions for the dependent variable given an independent variable . The implications for interpreting data using the least squares method include determining model adequacy by analyzing residuals, which should display no discernible patterns if the model is appropriate . It also involves checking for violations of assumptions such as linearly related variables, homoscedasticity, and non-autocorrelated residuals; failing which alternative methods may be necessary . The method is straightforward and applicable when assumptions are met, allowing researchers to make inferences about relationships and predictions, but caution is warranted in assuming causal relationships without additional context or tests .
Valued practices for delivering impactful research presentations include organizing content logically, considering the audience, and practicing delivery extensively. It is crucial to have a well-structured presentation with a clear and logical sequence of points , as a disorganized presentation risks miscommunicating research results . Presenters should tailor the talk to connect with the audience's interests and knowledge level , which involves anticipating audience questions and preparing accordingly . Practicing the presentation is also vital for refining delivery, as rehearsing helps in adjusting the pacing, ensuring clarity, and identifying distracting mannerisms or disruptions in flow . Maintaining eye contact and using clear, confident speech helps engage the audience effectively . These practices are important because they ensure the research findings are communicated clearly and remembered by the audience, thus fulfilling the primary purpose of educating or informing the audience .
Residual plots are valuable tools for diagnosing the adequacy of a regression model by revealing deviations that may not be evident from numerical output alone. These plots, such as residuals versus predicted values, residuals versus explanatory variables, and residuals versus time, help assess whether the model's assumptions hold. For example, they can be used to detect nonlinearity, which would manifest as patterns rather than random dispersion in the plot, indicating that the relationship between variables is not well captured by the model . Furthermore, residual plots can identify heteroscedasticity, where the spread of residuals varies across levels of an explanatory variable, suggesting a violation of the constant variance assumption . They also assist in checking for autocorrelation, particularly in time series data, where residuals exhibit correlations over time . A well-specified regression model will have residuals that are evenly scattered without discernible patterns, meaning that the model appropriately captures the relationships in the data without systematic errors . Residual analysis provides a straightforward mechanism to ensure model robustness and the validity of inference drawn from the model ."}
Challenges of inference from correlation analysis include mistakenly attributing causation to observed correlations due to potential confounding factors or bidirectional influences . To address these, researchers should rely on background knowledge, design experiments that account for third variables, and apply statistical methods that mitigate biases (e.g., robust regression). Without a theoretical basis, inferring causation is unjustifiable .
To ensure a research problem aligns with personal and external expectations, a researcher should verify that the problem is in line with their goal expectations and those of others. They must assess interest in the problem, free from biases, and ensure they have or can acquire necessary skills and resources . Additionally, they must consider the significance and scope required by their institution or publication standards .
A researcher may choose to suppress the constant in a regression model to force the regression line to pass through the origin, effectively setting the intercept to zero . This approach is taken when the theoretical context or the nature of the data justifies that the dependent variable Y should have a value of zero when all predictor variables X are also zero. Suppressing the intercept can simplify the model and interpretation in such cases . However, this affects interpretation by possibly leading to biased estimates, as it may not account accurately for the data structure if the true relationship does not naturally run through the origin .