Introduction
Econometrics involves the application of statistical and mathematical methods to analyze economic theories using data, employing various estimation techniques and hypothesis testing It distinguishes itself from economic statistics by unifying economic theory, mathematics, and statistics Typically, economic theories are represented in mathematical forms, while econometric methods utilize statistical techniques to explain economic phenomena in a stochastic context By estimating parameter values, which are the coefficients in mathematical equations that depict economic relationships, econometrics captures the random behavior of these relationships that traditional economic theories may overlook.
Econometrics is distinct from statistics, as it focuses on analyzing non-experimental data rather than relying solely on controlled experiments It employs statistical methods to validate economic theories by incorporating randomness into economic relationships Ultimately, econometrics aims to identify and quantify the stochastic elements in economic models using real-world data.
Econometrics has developed as a distinct field because traditional statistical methods often fall short in addressing complex economic questions Since economic issues cannot typically be examined in controlled experimental settings, real-world data is essential for identifying economic patterns Economic theories guide the formulation of econometric models, which are then used to analyze data and draw inferences There are two main branches of econometrics: theoretical econometrics, which focuses on creating new methods for measuring economic relationships, and applied econometrics, which utilizes these theories to analyze economic phenomena and forecast behavior.
Economic theory effectively informs the statistical methods used to create econometric models for analyzing economic issues Human capital theory posits that workers with similar attributes, such as education and experience, should receive comparable wages This theory can be quantified through a wage equation, where wages serve as the dependent variable and worker characteristics are represented as independent variables By integrating statistical patterns, this wage regression equation evolves into an econometric model that can be estimated using real-world data, allowing for the testing of human capital theory's validity Additionally, the estimation of this econometric model yields valuable economic insights.
1 Econometrics means measurement in economics It has started to develop systematically since the establishment of the Econometric Society in 1930 and the publication of the journal Econometrica in January 1933.
Gender wage discrimination is a significant area of research in labor economics, often analyzed through the lens of human capital theory By comparing the estimated wages of men and women with similar characteristics, we can assess the presence of discrimination If the null hypothesis of gender discrimination is upheld, it raises serious concerns regarding women's participation in the labor market Additionally, examining wage differences between groups over time can reveal trends; a narrowing gender wage gap may suggest improvements in labor market outcomes However, statistics often cast doubt on the overall participation of women in the labor market.
Return to education, a key issue in human capital theory, refers to the wage increase associated with each additional year of schooling and is crucial for making informed investment decisions in education Estimating this return through statistical methods poses challenges, particularly when using real-world data, as workers possess varying abilities that influence their earnings High-ability individuals tend to earn more than their lower-ability counterparts, even at the same education level Therefore, failing to account for these ability differences can lead to misleading conclusions about the true impact of education on wages.
In econometrics, we integrate economic theory with statistical methods to explore and analyze significant economic questions For instance, economic theory offers various models for stock prices, which can be utilized to assess the efficiency of stock markets When stock markets are deemed efficient, it implies that all available information accurately influences stock price movements.
Inefficiencies in the stock market can lead to arbitrage opportunities, allowing savvy investors to accumulate wealth Econometrics plays a crucial role in analyzing these arbitrage effects by applying economic theory alongside statistical tools Beyond economics, econometric methods are also valuable in fields such as engineering, biological sciences, medical sciences, geosciences, and agricultural sciences, as they effectively describe the stochastic relationships among various variables in mathematical terms.
This introductory chapter is structured to cover key concepts in econometrics Section 1.2 differentiates between economic and econometric models, while Section 1.3 explains the population and sample regression functions The distinction between parametric and nonparametric models is addressed in Section 1.4 In Section 1.5, the chapter outlines the steps involved in creating an econometric model, emphasizing the importance of data as a primary input Section 1.6 discusses the role of data in econometric analysis, highlighting the necessity of appropriate software for applying econometric theories to real-life data Section 1.7 introduces basic steps for using Stata 15.1, and Section 1.8 presents essential matrix algebra operations commonly utilized in econometrics.
Economic Model and Econometric Model
An economic model is a mathematical representation that simplifies real-world economic processes through a set of assumptions It effectively describes behaviors such as utility maximization under budget constraints, leading to the formulation of demand functions These demand functions illustrate the relationship between the quantity demanded of a commodity and factors like its price, the prices of related goods, consumer income, and preferences The demand equation derived from this economic model serves as a foundation for econometric analysis of consumer demand.
To analyze how income affects the demand for a commodity, we refer to economic theory, which posits that the quantity demanded is influenced by various factors, including the commodity's price, the prices of related goods, consumer income, and individual tastes and preferences This relationship is mathematically represented through a demand function, expressed as y = f(x1, x2, x3, ), where each variable captures different determinants of demand.
Here,ydenotes quantity demanded for a commodity,x 1 is income,x 2 its own price, andx 3 is price of other related commodities.
Under the ceteris paribus assumption, there is a distinct relationship between quantity demanded (y) and household income (x1) Assuming this relationship is linear, it can be represented by the equation y = β0 + β1x1.
This relationship can be expressed geometrically as shown in Fig 1.1and is known as the Engel curve in standard textbooks in intermediate microeconomics.
An economic model described by the relationship given in (1.2.2) is deterministic or purely mathematical A diagrammatic representation is shown in Fig.1.1.
1.2 Economic Model and Econometric Model 7
In reality, we cannot control all external factors that influence quantity demanded To validate the theoretical claim presented in Equation (1.2.2), it is essential to account for these unconsidered factors by introducing a new variable, denoted as \( u \), into the model This leads to the revised equation: \( y = \beta_0 + \beta_1 x_1 + u \) (1.2.3).
In this model, a single factor, x1, is used to explain the variable y, while all other influencing factors are collectively represented by u The disturbance term, u, disrupts the linear relationship between x1 and y, and it is also referred to as the error term.
The disturbance term or error term,u, is assumed to be random, the behaviour of which is described by a probability function Asuis unobservable, we cannot utilise
The introduction of a random disturbance term into the economic model transforms the income demand relationship into a stochastic one In reality, most relationships between economic variables are inherently stochastic, leading to the formation of an econometric model.
In an econometric model, the dependent variable is called an explained variable and the independent variables are called explanatory variables.
The random error term in a regression equation represents the portion of the dependent variable that remains unpredictable by the independent variables It accounts for the influence of numerous omitted factors For instance, in analyzing household demand for a commodity, income is not the sole determinant; factors such as family size, preferences, and spending habits also significantly impact demand.
The error term in econometric models accounts for unobserved variables, some of which may be unquantifiable or unidentified To mitigate the impact of these unobserved disturbances, we can increase the number of explanatory variables In the equation y = β0 + β1x1 + β2x2 + β3x3 + ε, the variable u represents the price of the commodity (x2), the price of other commodities (x3), and the tastes and preferences of buyers as determined by the utility function Since x2 and x3 are observable, we can isolate them from the random disturbance u.
The coefficients β0, β1, β2, and β3 define the relationship between quantity demanded, consumer income, the price of the specific commodity, and the prices of related goods Additionally, ε represents the unobserved preferences of buyers in this equation.
An economic model establishes a theoretical relationship, while an econometric model analyzes real-life situations, highlighting the variability in buyer behavior at the same income level due to individual preferences This distinction underscores that an econometric model is an empirical formulation derived from an economic model, incorporating both deterministic and stochastic components The stochastic element, represented by a disturbance term, follows a probability distribution and is crucial for understanding the nature of the relationship between variables By specifying a particular econometric model, ambiguities in the economic model are clarified, with variable selection guided by economic theory and data considerations The disturbance term captures the influence of unaccounted factors, enhancing the model's accuracy in reflecting real-world dynamics.
An econometric model is developed from an economic model to illustrate potential relationships between variables While the economic model offers a logical framework for understanding an issue, it does not confirm its validity To validate the logical relationships and underlying assumptions, an econometric model must be specified based on the economic model's formulation, and hypotheses must be tested using sample data.
Population Regression Function and Sample Regression
In econometrics, a population refers to the entire set of elements relevant for analysis, akin to the universal set in set theory Economic theories propose propositions that are assumed to apply universally, thus focusing on the population Consequently, econometric models derived from these economic theories aim to relate to the population A primary goal of econometrics is to draw inferences about populations.
The econometric model presented in Eq (1.2.3) indicates that a specific value of y cannot be uniquely determined for a given value of x1 due to the influence of the error term u This results in a stochastic relationship between y and x1, which can be characterized by a probability distribution of u.
If the error term \( u \) follows a normal distribution, then \( y \) in Equation (1.2.3) will also exhibit a normal distribution When the mean and variance of \( u \) are 0 and \( \sigma^2 \), respectively, the conditional mean and variance of \( y \) can be determined accordingly.
1.3 Population Regression Function and Sample Regression Function 9
The population regression function (PRF) represents the conditional mean function, illustrating the relationship between the expected values of the population regressand (y) and the population regressors (x) This theoretical framework in econometrics indicates that the average value of y is influenced by changes in x1, although these changes do not affect the actual value of y This fundamental concept of regression analysis is explored in greater detail in Chapter 2.
For each value of x₁, various corresponding values of y can be derived from the normal density curve, as illustrated in Fig 1.2 Connecting the mean values of y for different x₁ values forms a straight line, referred to as the population regression line (see Fig 1.2).
The population represents the complete set of potential outcomes from a random experiment, serving as the theoretical foundation of an econometric model However, the population itself is unobserved, and researchers rely on a finite subset known as a sample, which is drawn from the population This sample is essential for validating the theoretical model The primary goal of econometric analysis is to infer characteristics of the unobserved population based on the observed sample, a process referred to as statistical inference.
The sample regression function (SRF) is an econometric model derived from a specific dataset, allowing us to estimate relationships and draw inferences about a larger population In this context, x1i and yi represent the actual values of the variables x1 and y for each observation unit in the sample, illustrating the connection between y and x1 for the respective cross-section.
Thus, Eq (1.3.3) is the sample counterpart of Eq (1.2.3) It presents that the relationship for observationiandu i is realisations (sampled values) of error variables.
Ifβˆ0andβˆ1are the estimated values ofβ 0andβ 1, respectively, by using the sample observations, then the estimated conditional mean value ofywill be ˆ y i = ˆβ 0+ ˆβ 1 x 1i (1.3.4)
Equation (1.3.4) is the sample counterpart of (1.3.1) and is called the sample regression function (SRF) The SRF is the estimated relation between estimatedy i andx 1i
Econometric models primarily focus on estimating unknown population parameters based on known sample statistics In a linear regression model, the objective is to estimate the structural relationship function (SRF) to explore the relationship between the dependent variable (y) and the independent variable (x), as proposed by the researcher's hypothesis.
Parametric and Nonparametric or Semiparametric Model
The parametric econometric model relies on prior knowledge of the functional relationship between variables, which, if accurate, allows for effective data explanation However, incorrect functional form selection based on a priori information can lead to biased results (Fan and Yao, 2003) This model incorporates all available data information through its parameters; for instance, in a linear regression with a single regressor, two parameters—the coefficient and intercept—are derived from the data analysis.
1.4 Parametric and Nonparametric or Semiparametric Model 11 model has a fixed number of parameters, each with a fixed meaning The simplest example is the Gaussian model parametrised by its mean and variance.
Nonparametric regression models offer greater flexibility by relaxing the assumption of linearity in regression analysis, allowing for a more nuanced exploration of data These models are defined endogenously, meaning the data structure determines the regression model's form, utilizing extensive data information for estimation With infinite-dimensional parameters, nonparametric models provide more degrees of freedom and adaptability, as exemplified by kernel density estimators that capture intricate distribution details through successive correction terms Despite their advantages, nonparametric models lack a fundamental distinction from parametric models, as both ultimately approximate functional forms through numerous parameters However, parametric models are often favored for their ease of estimation, interpretability, and superior statistical properties, which is why this book primarily focuses on parametric econometric models.
Steps in Formulating an Econometric Model
Estimation
Econometric models are developed using observed sample data and appropriate estimation methods, followed by hypothesis validity testing Among the parametric models, three widely used estimation methods stand out.
• the method of least squares and
• the method of maximum likelihood.
The method of moments leverages moment conditions related to the zero unconditional and conditional means of random errors Ordinary least squares (OLS) is the most widely used estimation method, focusing on minimizing the residual sum of squares (RSS) to select parameter estimators Additionally, the method of maximum likelihood serves as a foundational approach for parametric classical estimation in econometrics, where maximum likelihood estimators are derived by maximizing the probability of observing the given responses based on the observed data.
In the nonparametric model, the method of estimations includes
Nonparametric estimation requires fewer assumptions but limits the conclusions that can be drawn from data Generally, the best parametric estimator outperforms the best semiparametric estimator, with the generalized method of moments being central to semiparametric estimation Bayesian estimation has become significant for offering elegant and manageable solutions to various problems, while simulation-based estimation and bootstrapping address computationally challenging issues In a random sample of size n from a population distribution with parameter β, an estimator, denoted as βˆ T(y1, …, yn), serves as a function of the sample, with its value representing the estimate of β An effective estimator should be closely centered around β with minimal variance, ensuring that as n approaches infinity, the estimator converges to β with high probability, thus defining it as a point estimator.
=1−α,α∈(0,1), is called a 100(1−α)% confidence interval of β The random variablesβˆ1andβˆ2are called the lower and upper limits, respectively;
1−αis called the confidence coefficient.
Testing of Hypothesis
After estimating the model, it is essential to conduct goodness-of-fit testing to evaluate its accuracy Hypothesis testing plays a crucial role in statistical inference, allowing us to draw conclusions about a population based on a sample For instance, we use a sample mean to estimate the population mean and verify the validity of our claim regarding the population mean by analyzing the sampling distribution of the sample mean This statistical process assesses the likelihood of claims about a population using data derived from a representative sample.
The sample mean serves as an unbiased estimator of the population mean, indicating that, on average, it will equal the population mean For instance, if the population mean of household income is Rs 15,000, then the average sample mean will also be Rs 15,000, confirming the accuracy of this estimation.
Hypothesis testing in econometrics involves several key steps First, an econometric model is specified, followed by the formulation of hypotheses grounded in theoretical frameworks For instance, one might hypothesize that the price of other commodities (x3) does not influence the demand for a specific commodity, which can be mathematically represented as the population parameter β3=0.
To evaluate this hypothesis, we will estimate β3 from a random sample of the population and compare it to the expected value assuming the claim is valid We anticipate that the estimated β3 will be close to 0 If the difference between the estimated statistic and the expected parameter is minimal, we will uphold the claim; however, a significant discrepancy will lead us to reject it.
Forecasting
Forecasting plays a crucial role in economic decision-making, enabling policymakers to assess the effectiveness of economic policies It involves predicting future data values through estimated models, with regression analysis commonly employed for this purpose By relying on regression models, forecasts are based on the assumption that the relationships identified in the model will persist into the future.
1.5 Steps in Formulating an Econometric Model 15
In time series econometrics, there are two main types of forecasts: ex-post and ex-ante Ex-post forecasts utilize actual information available beyond the estimation period, allowing for the evaluation of forecasting model performance Conversely, ex-ante forecasts are created for future periods where actual data is unavailable, relying on predicted values of the influencing factors to generate these forecasts.
Data
Cross Section Data
Cross-section data is obtained through sample surveys or complete enumeration, capturing information from units such as households, firms, or countries at a specific point in time Typically, data is collected over a brief duration, often within a year, resulting in what is termed a cross-section data set The data generation process during this short timeframe is deterministic, indicating that the observed values of variables, like income, are influenced by known factors understood by respondents, making cross-section data non-stochastic in nature.
Cross-section data is collected by individual researchers or official agencies, such as the National Sample Survey Office (NSSO) in India, which operates under the Ministry of Statistics and Programme Implementation (MOSPI) The NSSO conducts various surveys, notably the household consumer expenditure survey and the employment and unemployment survey, which are key sources of cross-section data in Indian official statistics This type of data is extensively utilized in economics and other social sciences, providing valuable insights for research and analysis.
Non-experimental data are gathered without the use of controlled experiments, while experimental data are obtained in laboratory settings Section data play a crucial role in various fields, including labor economics, industrial organization, demography, health economics, and other areas of applied microeconomics.
Cross-sectional data are typically gathered through random sampling from a larger population, which facilitates easier analysis However, there are instances where the sampling may not be entirely random For instance, when examining factors that influence the purchase of a new car, some households may lack the income or wealth necessary to buy one and might choose not to participate in the survey Although the data collection method is random sampling, the resulting sample may not represent the population accurately, leading to a sample selection problem.
Time Series Data
Time series data involves observations on one or more variables collected over time, with time being a crucial dimension Most macroeconomic data is represented as time series, which are characterized by a stochastic data-generating process and a joint probability density function Researchers must analyze the stochastic behavior of these variables before incorporating them into econometric models Unlike cross-sectional data, time series data is typically estimated and found in official statistics rather than collected through surveys In India, the National Accounts Division (NAD) of the Central Statistics Office (CSO) provides the primary source of macroeconomic time series through National Accounts Statistics (NAS) Time series data is essential for analyzing trends and forecasting in macroeconometric models, as well as for predicting volatility and mean returns in finance.
Time series data are characterized by their strong dependence on recent historical values, presenting a significant challenge when applying standard econometric models Consequently, additional steps are required to properly specify econometric models for time series data prior to utilizing conventional econometric techniques.
Pooled Cross Section
Pooled cross section data is created by combining two or more sets of cross section data that address similar issues, collected from different samples at various time points within the same population This type of data retains characteristics similar to that of traditional cross section data For instance, consider two cross section datasets from the National Sample Survey Office (NSSO) in India, one from an employment survey in 2004 and another from 2011, both utilizing the same sample design but drawn from different samples.
1.6 Data 17 households chosen randomly from the same population both in 2004 and in 2011 If we combine these two different random samples in two different time periods from the same population, we get pooled cross section The use of pooled cross section data provides more robust result because it contains more number of observations for different time periods The pooled cross section data are useful to look into the changing behaviour over two or more time points.
Panel Data
Panel data combines cross-sectional and time series data, obtained by conducting repeated surveys with the same sample units over time to gather information on similar issues Each cross-sectional unit's time series creates a set of panel or longitudinal data When the cross-sectional units are micro units, such as households or firms, the data is referred to as a micro panel, where the time dimension is smaller than the cross-sectional dimension Conversely, if the units are macro entities like countries, the data is termed a macro panel, characterized by a significantly larger time dimension compared to the cross-sectional dimension Additionally, panel data can be classified as balanced or unbalanced, depending on whether complete information is available for all units at every time point.
Panel data uniquely analyzes the same cross-sectional units over time, but acquiring such data, particularly on individuals, households, and firms, is challenging in developing countries In India, for instance, official statistics lack micro panel data, leading to a growing reliance on pooled cross-section data for econometric modeling in the developing world.
Use of Econometric Software: Stata 15.1
Data Management
This article addresses key aspects of working with Stata data files, including methods for opening data files, importing data into Stata format, and extracting raw data efficiently.
Stata data sets are rectangular arrays withnobservations onmvariables.
1.7 Use of Econometric Software: Stata 15.1 19
If we have a data file (.dta files) saved in the hard disk, we could use the menus to open it:File—Open—select the file.
Alternatively, we can write the full path where the file is located with the command use
Stata can import Excel (.xls) files easily If we have an Excel file as a CSV (Comma Delimited), we can import it by using the command: insheet using (file.csv)
Extract raw survey data form note pad
We can extract the raw data by constructing a dictionary file with the command infix or infile
The infile command specifies the variable names, while the keyword 'using' indicates the file name that contains the data in free format, with variables separated by spaces, commas, or tabs.
To analyze survey data in a fixed format, utilize the infix command to indicate the specific positions of each variable, followed by the corresponding txt file name.
If we have a large number of variables, we can create a dictionary file by using infile command, but the syntax for a dictionary file is complicated.
After creating a Stata system file, we can save it by using
To open a Stata data, the following command is to be executed.
To delete some variables from the data file, we can use drop command followed by the variable names.
Alternatively, if we want to retain some variables we have to use keep command
For adding more observations, we need to use append command.
For adding more variables, after making sort the data by cross section id (csid) we can use the following command. merge csid using (file name)
In merging, a new variable, _merge is created and we have to execute the following command before moving further. tab _merge drop _merge
Variable names in Stata can be up to 32 characters long, but it's advisable to keep them shorter for clarity Stata is case-sensitive, meaning that "Age" and "age" are treated as distinct variables For multi-word variable names, such as "family income," underscores should be used to connect the words, resulting in "family_income."
In Stata, variables are categorized into two main types: string variables and numerical variables Numerical variables can be further divided into continuous and categorical types, with categorical variables representing distinct groups; for instance, male is assigned the code 1, while female is represented by the code 2.
In Stata 15.1, string variables can reach lengths of up to two billion characters, but for fixed lengths, you should use str1 for one character or str20 for twenty characters, while strL is designated for long strings When manipulating string variables, it's important to remember that Stata requires strings to be enclosed in double quotation marks.
To convert a string variable into a numeric one in Stata, use the command `destring` For nonnumeric string variables, utilize the `encode` command to achieve the conversion Conversely, to transform numeric variables back into strings, the `decode` command is required.
To simplify the representation of a national industrial classification, we can reduce the number of digits in the variable For instance, if the original classification is a 5-digit number stored in the variable `nic5`, we can create a new variable `nic2` that represents the classification in 2 digits This can be achieved by using the command `gen twodigit=int(nic5/100)` if `nic5` is numeric, or `gen strtwodigit=substr(nic5, 1,2)` if `nic5` is a string.
Missing values are common in micro surveys, with numeric variables represented by a dot and string variables indicated by an empty string (“”).
Stata can label the data by using the label data command In Stata SE, we can label up to 244 characters.
label data "Consumer Expenditure Survey Data"
We can also add notes by using the notes command followed by a colon:
1.7 Use of Econometric Software: Stata 15.1 21
To label variables in a data set, you can use the command "label variable" followed by the variable name and a description of up to 80 characters in quotes For instance, to label a variable named "temp" as "Temperature Degree C," you would execute the command accordingly.
label variable temp “Temperature Degrees C”
In our dataset, we can assign labels to the categorical variable "social_group," where ST is represented as 1, SC as 2, OBC as 3, and Others as 9, according to the NSS survey data.
label define social_group 1 “ST” 2 “SC” 3 “OBC” 9 “Others”, replace
label values social_group social_group
label variable social_group "Caste of Household"
Generating Variables
The generate command allows users to create a new variable by applying a suitable operator For instance, to generate a new variable representing the natural log of price, referred to as logged_price, the appropriate command can be utilized.
gen logged_price = ln(price)
To generate a variable equal to twice the square root of price, we have to use the command
gen twice_root_price = 2*sqrt(price)
Variable names must not include spaces, and the high correlation between the variables price and wice_root_price can lead to collinearity issues in linear regression To mitigate this, we can center the variable by subtracting the mean before applying the square root The quietly summarize command allows us to obtain the mean from the stored result r(mean) while suppressing the output.
gen ctwice_root_price = 2*sqrt(price-r(mean))
Here, the centred variable is generated with different name, ctwice_root_price ,to retain the earlier one.
The egen (extended gen) command is similar to gen but with extra options For example, to create a new variable average price we use the following command:
egen avg_price = mean(price)
To create 99th percentile of price, we have to enter
egen high_price = pctile(price), p(99)
In data analysis, it is often necessary to recode specific values to focus on particular comparisons For instance, when analyzing NSS data that includes four social groups, one might choose to concentrate on comparing the Scheduled Tribes (ST) with the other social groups.
gen D_ST=(social_group==1) // assuming that social group is defined by the vari- able social_group and ST is codded by 1
To categorize ages from a dataset ranging between 15 and 65 years into 10-year age groups, you can use a specific coding command.
Describing Data
Use the sum command, which summarises all the variables.
To summarise a particular variable, price, for example, we have to use the follow- ing command sum price, detail
To see how consumption changes as temperature changes bysort temp: sum consumption
The tab command is utilized to display the frequency distribution of a specific variable, while the tabstat command is designed to calculate and present the mean values for continuous variables, such as the mean price.
For median, tabstat price, stats(med)
For variance, tabstat price, stats(var)
The command "tabout" allows for the extraction of results in a presentable format To create an Excel table displaying the numbers for each county, utilize the command "tabout county using filename.xls, replace."
The "replace" option will overwrite an existing file, while the "append" option allows you to add new results to the same file For example, using the command "tabout county using filename.xls, append cells(freq co)" will append the results to the specified Excel file.
Graphs
Stata has excellent graphic facilities, accessible through the graph command Graph- ics editor can be used to modify a graph interactively.
For continuous data, the easiest way to visualise the relationship between two variables (e.g consumption (cons) and price) is to produce a scatterplot of them.
To produce a simple scatterplot of consumption change by price setting, use the command.
If we want the best-fitted line for two variables, the relevant command is graph twoway lfit cons price
To combine multiple plots in a single graph, we have to use the twoway command.
twoway (scatter cons price) (lfit cons price)
1.7 Use of Econometric Software: Stata 15.1 23
We can add confidence interval bands around the line of best fit by using the command.
graph twoway (lfitci cons price) // /(scatter cons price)
To label the points with the values of a variable, we have to use the mlabel(varname)
graph twoway (lfitci cons price) ///(scatter cons price, mlabel(social_group))
We can also include titles, labels and legends in a two-way (cons and price) graph by using the following commands.
graph twoway (lfitci cons price) ///(scatter cons price, mlabel(social_group) mlabv(pos))/// , title("Price Consumption Relationship") ///ytitle("price") /// legend(ring(0) pos(5) order(2 "linear fit" 1 "95% CI"))
graph export fig31.png, width(500) replace
Histogram can be drawn by using the histogram command To draw a histogram for price distribution, we use
We can draw a bar diagram for cons and price by using graph bar cons price
Logical Operators in Stata
In Stata, standard operators include arithmetic operations such as multiplication (*) and division (/), as well as addition (+) and subtraction (–) Logical operators consist of 'and' (&) and 'or' (|), while relational operators include not equal (!), less than (), less than or equal (≤), greater than or equal (≥), and equal (==) Additionally, the power operator is represented by (ˆ).
Functions Used in Stata
Stata has a large number of functions; the following are the frequently used mathe- matical functions:
Integer obtained by truncating x towards zero int(x)
Natural logarithm of x if x > 0 ln(x) or log(x)
Log base 10 of x (for x > 0) log10(x)
Log of the odds for probability x: logit(x) = ln(x/(1 − x)) logit(x)
Maximum of x1, x2, …, xn, ignoring missing values max(x1, x2, …, xn)
Minimum of x1, x2, …, xn, ignoring missing values min(x1, x2, …, xn) x rounded to the nearest whole number round(x)
Square root of x if x ≥ 0 sqrt(x)
Stata has a function to generate random numbers which are useful in simulation.
The tool offers a comprehensive array of functions for calculating probability distributions and their inverses, featuring normal() for the cumulative distribution function (CDF) of the normal distribution and invnormal() for its inverse Additionally, it allows for the simulation of observations that follow a normal distribution.
To see a complete list of functions type help mathfun
Matrix Algebra
Matrix and Vector: Basic Operations
A matrix is a rectangular or square array of numbers or variables arranged in rows and columns We use here a capital letter to denote a matrix and small letter for a vector.
The matrixAcan be expressed asA=(a ij ).
A vector is a matrix with a single column or row.
If a matrix contains zeros in all off-diagonal positions, it is said to be a diagonal matrix.
A diagonal matrix with a 1 in each diagonal position is called an identity matrix and is denoted byI.
An upper triangular matrix is a square matrix with zeros below the diagonal A lower triangular matrix is defined similarly.
Ifaandbare bothn×1, then the sum of products is a scalar.
On the other hand, ab is defined for any sizea andb and is a matrix, either rectangular or square: ab ⎛
A vector \( a \) is defined as a sum of squares, while \( aa \) represents a symmetric square matrix The operations involving \( a \) and \( aa \) are known as the dot product and matrix product, respectively Additionally, the square root of the sum of squares of the elements in vector \( a \) indicates the distance from the origin to the point represented by \( a \), commonly referred to as the length of the vector.
Whenjis the unit vector andJis the unit matrix, then j j =n (1.8.5) j j ⎛
Thus,a jis the sum of the elements ina,j Acontains the column sums ofA, and
Ajcontains the row sums ofA.
Sincea bis a scalar, it is equal to its transpose: a b a b
This allows us to write a b 2 a b a b a b b a
We can express matrix multiplication in terms of row vectors and column vectors.
Ifa i is theith row ofAandb j is thejth column ofB, then the (i j) th element ofAB isa i b j
For example, ifAhas three rows andBhas two columns,
LetAbe a 2×pmatrix,xbe ap×1 vector, andSbe ap×pmatrix Then
AS A a 1 Sa 1 a 1 Sa 2 a 2 Sa 1 a 2 Sa 2
Similarly, if we express matrixAin terms of its columns as
If the diagonal matrix is the identity, we have
In the case of a rectangular matrix, the two identities differ in size, yet the fundamental principle remains intact The multiplication of a scalar with a matrix involves multiplying each element of the matrix by the scalar value, represented as cA = c * a(i,j).
Multiplication of vectors or matrices by scalars permits the use of linear combi- nations, such as n i = 1 a i x i =a 1 x 1 +a 2 x 2 + ã ã ã +a n x n (1.8.19) n i =1 a i B i =a 1 B 1+a 2 B 2+ ã ã ã +a n B n (1.8.20)
IfAis a symmetric matrix andxandyare vectors, the product x Ax i a ii x i 2 + i = j a i j x i x j (1.8.21) is called aquadratic form, whereas x Ay i , j a i j x i y j (1.8.22) is called abilinear form.
Partitioned Matrices
It is sometimes convenient to partition a matrix into submatrices For example, a partitioning of a matrixAinto four submatrices could be indicated symbolically as follows:
SupposeA 11 andA 22 are square and nonsingular (not necessarily the same size), the determinant is given by either of the following two expressions:
When two matrices A and B are conformable and partitioned into appropriately conformable submatrices, their product AB can be computed using the standard row-by-column multiplication method, treating the submatrices as single elements Additionally, matrix-vector multiplication can also be performed in a partitioned format.
Rank of a Matrix
A set of vectorsa 1 ,a 2 , …,a n is said to be linearly dependent if the following relation holds:
(not all zero) can be found such that
1.8 Matrix Algebra 29 c 1 a 1+c 2 a 2+ ã ã ã +c n a n =0 (1.8.25) wherec 1,c 2, …,c n are constants.
If no constantsc 1 ,c 2 , … ,c n can be found satisfying (1.8.25), the set of vectors is said to be linearly independent.
If the condition (1.8.25) is satisfied, it indicates that at least one vector \( a_i \) can be represented as a linear combination of the other vectors in the set, demonstrating linear dependence and redundancy In contrast, linearly independent vectors do not exhibit this type of redundancy.
The rank of any square or rectangular matrixAis defined as rank(A)=number of linearly independent rows ofA
=number of linearly independent columns ofA.
The rank of a matrix A refers to the size of the largest set of linearly independent columns (column rank) or rows (row rank) Specifically, the column rank represents the maximum number of linearly independent column vectors, while the row rank indicates the maximum number of linearly independent row vectors Additionally, the column rank corresponds to the dimension of the column space of A, whereas the row rank reflects the dimension of the row space of A.
The number of linearly independent rows in a matrix is always equal to the number of linearly independent columns, which means that the column rank is equal to the row rank for any given matrix.
IfAisn×p, the maximum possible rank ofAis the smaller ofnandp, in which caseAis said to be of full rank.
The matrix 5 2 4 has a rank of 2, indicating that its two rows are linearly independent, as neither row is a multiple of the other However, the columns of the matrix are linearly dependent, since a rank of 2 means there are only two linearly independent columns Consequently, there exist constants c1, c2, and c3 that satisfy the linear dependence relationship among the columns.
A solution vector to the equation (1.8.27) can be represented as any multiple of the vector c = (14, −11, −12) This leads to the notable conclusion that the product of a matrix A and the vector c equals zero, despite both A and c being non-zero This phenomenon arises from the linear dependence of the column vectors within the matrix A.
Another consequence of the linear dependence of rows or columns of a matrix is the possibility of expressions such asAB=CB, whereA = C For example, let
All three matricesA,BandCare full rank; but being rectangular, they have a rank deficiency in either rows or columns, which permits us to constructAB=CBwith
A = C Thus, in a matrix equation, we cannot, in general, cancel matrices from both sides of the equation.
Inverse Matrix
If a matrixAis square and of full rank, thenAis said to be nonsingular, andAhas a unique inverse, denoted byA −1 , with the property that
IfAis nonsingular, its determinant is nonzero.
A square matrix A is considered singular if it is of less than full rank, meaning that an inverse does not exist In such cases, the determinant of the matrix A is equal to 0 It is important to note that rectangular matrices, regardless of whether they are full rank, do not possess inverses.
If a matrix is nonsingular, it can be cancelled from both sides of an equation, provided it appears on both sides For example, ifBis nonsingular, then
IfAandBare the same size and nonsingular, then the inverse of their product is the product of their inverses in reverse order,
The inverse of the transpose of a nonsingular matrix is given by the transpose of the inverse.
A symmetric matrixAis said to be positive definite if x Ax> 0 for all possible nonzero vectorsx.
Similarly,Ais positive semi-definite if x Ax≥0 for allx=0.
The diagonal elementsa ii of a positive definite matrix are positive IfAis positive definite, its determinant is positive.
One way to obtain a positive definite matrix is as follows:
IfA=B B, whereBisn×pof rankp0 (1.8.29)
A positive definite matrixAcan be factored into the following way:
A=T / T,whereT is a nonsingular upper triangular matrix.
One way to obtainT is the Cholesky decomposition.
A simple function of ann×nmatrixAis the trace, denoted by tr(A) and defined as the sum of the diagonal elements ofA; tr(A) n i =1 a ii (1.8.30)
The trace of a matrix is a scalar.
The trace of the sum of two square matrices is the sum of the traces of the two matrices: tr(A+B)=tr(A)+tr(B) (1.8.31)
An important result for the product of two matrices is tr(A B)=tr(B A) (1.8.32)
Two vectorsaandbof the same size are said to be orthogonal if a b=a 1 b 1 +a 2 b 2 + ã ã ã +a n b n =0 (1.8.33)
Geometrically, orthogonal vectors are perpendicular.
If a a =1, the vector a is said to be normalised The vectora can always be normalised by dividing by its length√ a a
Thus,c= √ a a a is normalised so thatc c=1.
A matrixC=(c 1 , c 2, …,c p ) whose columns are normalised and mutually orthog- onal is called an orthogonal matrix.
Multiplication by an orthogonal matrix has the effect of rotating axes; that is, if a pointxis transformed toz=Cx, where C is orthogonal, then z z=(C x) (C x)=x C C x =x I x=x x (1.8.35)
In this case, the distance from the origin tozis the same as the distance tox.
For every square matrixA, a scalarλand a nonzero vectorxcan be found such that
In (1.8.36),λis called an eigenvalue ofA, andx is an eigenvector ofAcorre- sponding toλ To find outλandx, we write (1.8.36) as
If |A−λI|=0, then (A−λI) has an inverse andx=0 is the only solution for (1.8.37).
Hence, in order to obtain nontrivial solutions, we set
|A−λI| =0 (1.8.38) to find values ofλthat can be substituted into (1.8.37) to find corresponding values ofx.
The columns of the matrix \( A - \lambda I \) may be linearly dependent, which means that for equation (1.8.37) to yield a non-zero solution vector \( x \), the matrix \( A - \lambda I \) must be singular This leads to the formulation of the characteristic equation, represented as equation (1.8.38) For an \( n \times n \) matrix \( A \), the characteristic equation will have \( n \) roots, indicating that \( A \) possesses \( n \) eigenvalues \( \lambda_1, \lambda_2, \ldots, \lambda_n \) It is important to note that these eigenvalues may not all be distinct or non-zero.
After finding outλ 1,λ 2, …,λ n , the corresponding eigenvectorsx 1 ,x 2 , …,x n can be found using (1.8.37).
If we multiply both sides of (1.8.37) by a scalark, we obtain
An eigenvector of a matrix A retains its identity even when multiplied by a scalar, meaning that if x is an eigenvector, then kx is also an eigenvector for any scalar k While eigenvectors can vary in length, their direction from the origin remains uniquely defined.
Typically, the eigenvectorxis scaled so that x x =1.
Ifλis an eigenvalue ofAandxis the corresponding eigenvector, then 1+λis an eigenvalue ofI+Aand 1−λis an eigenvalue ofI−A.
In either case,xis the corresponding eigenvector.
The eigenvectors of an n×n symmetric matrix Aare mutually orthogonal It follows that if theneigenvectors ofAare normalised and inserted as columns of a matrixC=(x 1 , x 2, …,x n ), thenCis orthogonal.
I =CC , (1.8.41) which we can multiply byAto obtain
The expressionA=CDC for a symmetric matrixAin terms of its eigenvalues and eigenvectors is known as the spectral decomposition ofA.
SinceCis orthogonal andC C=CC =I, we can write
Thus, a symmetric matrixAcan be diagonalised by an orthogonal matrix con- taining normalised eigenvectors of A, and the resulting diagonal matrix contains eigenvalues ofA.
• The unification of economic theory, mathematics and statistics constitutes what is called econometrics.
• Econometric methods are helpful in explaining the stochastic relationship in math- ematical format among variables.
• In an econometric model, the dependent variable is called an explained variable and the independent variables are called explanatory variables.
• The random error or disturbance term tells us about the parts of the dependent variable that cannot be predicted by the independent variables in the equation.
• The formulation of economic models in an empirically testable form is an econo- metric model.
• An economic model provides a theoretical relation, but an econometric model is a relationship used to analyse real-life situation.
• The choice of variables in the econometric model is determined by the economic theory as well as data considerations.
• The error term or disturbance term is perhaps the most important component of any econometric analysis.
The conditional mean function, also known as the population regression function, serves as the econometric model representing the population It establishes the relationship between the expected value of the population regressand (y) and the population regressors (x).
• The econometric model based on the sample is called the sample regression func- tion.
• The parametric econometric model is based on the prior knowledge of the func- tional form relationship.
• The nonparametric econometric model is specified endogenously on the basis of the data.
• When a regression model is specified in linear form in terms of log of the variables, the regression coefficients measure the proportional change.
• The method of moments utilises the moment conditions relating to zero uncondi- tional and conditional mean of the random errors.
• The least squares principles suggest that we should select the estimators of the parameters so as to minimise the residual sum of square.
• Maximum likelihood estimators are obtained by maximising the probability of observation of the responses.
• Hypothesis testing is a statistical process to test the claims or ideas about a popu- lation on the basis of a sample drawn from it.
• The null hypothesis (H 0) is a statement about a population parameter.
The alternative hypothesis (H1) serves as a direct counter to the null hypothesis, asserting that the true value of a population parameter is either less than, greater than, or not equal to the value proposed by the null hypothesis.
• The alternative hypothesis determines which tail of a sampling distribution to place the level of significance.
• The rejection region is the region beyond a critical value in a hypothesis test.
• The test statistic is a value obtained by exploiting the nature of the sampling distribution of the statistic.
• Type I error is the probability of rejecting a null hypothesis that is actually true.
• Type II error is the probability of retaining a false hypothesis.
• Cross section data are not stochastic in nature.
• The data generating process of time series is stochastic, and the realisation of time series data is characterised by a joint probability density function.
• A time series for each cross section unit forms a set of panel data or longitudinal data.
Davidson, R., and J.G MacKinnon 1981 Several Tests for Model Specification in the Presence of Alternative Hypotheses Econometrica 49: 781–793.
Fan, J., and Q Yao 2003 Nonlinear Time series: Nonparametric and parametric methods New
Ramsey, J.B 1969 Tests for Specification Errors in Classical Linear Least-Squares Analysis Jour- nal of the Royal Statistical Society: Series B 71: 350–371.
Wooldridge, J.M 1994 A Simple Specification Test for the Predictive Ability of TransformationModels Review of Economics and Statistics 76: 59–65.
Trace of a Matrix
A simple function of ann×nmatrixAis the trace, denoted by tr(A) and defined as the sum of the diagonal elements ofA; tr(A) n i =1 a ii (1.8.30)
The trace of a matrix is a scalar.
The trace of the sum of two square matrices is the sum of the traces of the two matrices: tr(A+B)=tr(A)+tr(B) (1.8.31)
An important result for the product of two matrices is tr(A B)=tr(B A) (1.8.32)
Orthogonal Vectors and Matrices
Two vectorsaandbof the same size are said to be orthogonal if a b=a 1 b 1 +a 2 b 2 + ã ã ã +a n b n =0 (1.8.33)
Geometrically, orthogonal vectors are perpendicular.
If a a =1, the vector a is said to be normalised The vectora can always be normalised by dividing by its length√ a a
Thus,c= √ a a a is normalised so thatc c=1.
A matrixC=(c 1 , c 2, …,c p ) whose columns are normalised and mutually orthog- onal is called an orthogonal matrix.
Multiplication by an orthogonal matrix has the effect of rotating axes; that is, if a pointxis transformed toz=Cx, where C is orthogonal, then z z=(C x) (C x)=x C C x =x I x=x x (1.8.35)
In this case, the distance from the origin tozis the same as the distance tox.
Eigenvalues and Eigenvectors
For every square matrixA, a scalarλand a nonzero vectorxcan be found such that
In (1.8.36),λis called an eigenvalue ofA, andx is an eigenvector ofAcorre- sponding toλ To find outλandx, we write (1.8.36) as
If |A−λI|=0, then (A−λI) has an inverse andx=0 is the only solution for (1.8.37).
Hence, in order to obtain nontrivial solutions, we set
|A−λI| =0 (1.8.38) to find values ofλthat can be substituted into (1.8.37) to find corresponding values ofx.
The columns of the matrix A−λI can be linearly dependent, which implies that the matrix A−λI must be singular for a non-zero solution vector x to exist This leads to the characteristic equation, denoted as Equation (1.8.38) For an n×n matrix A, this characteristic equation will yield n roots, indicating that A will have n eigenvalues, denoted as λ₁, λ₂, …, λₙ It is important to note that these eigenvalues may not all be distinct or non-zero.
After finding outλ 1,λ 2, …,λ n , the corresponding eigenvectorsx 1 ,x 2 , …,x n can be found using (1.8.37).
If we multiply both sides of (1.8.37) by a scalark, we obtain
If x is an eigenvector of matrix A, then kx is also an eigenvector for any scalar k Eigenvectors are unique only up to scalar multiplication, allowing us to modify the length of x while maintaining its unique direction from the origin.
Typically, the eigenvectorxis scaled so that x x =1.
Ifλis an eigenvalue ofAandxis the corresponding eigenvector, then 1+λis an eigenvalue ofI+Aand 1−λis an eigenvalue ofI−A.
In either case,xis the corresponding eigenvector.
The eigenvectors of an n×n symmetric matrix Aare mutually orthogonal It follows that if theneigenvectors ofAare normalised and inserted as columns of a matrixC=(x 1 , x 2, …,x n ), thenCis orthogonal.
I =CC , (1.8.41) which we can multiply byAto obtain
The expressionA=CDC for a symmetric matrixAin terms of its eigenvalues and eigenvectors is known as the spectral decomposition ofA.
SinceCis orthogonal andC C=CC =I, we can write
Thus, a symmetric matrixAcan be diagonalised by an orthogonal matrix con- taining normalised eigenvectors of A, and the resulting diagonal matrix contains eigenvalues ofA.
• The unification of economic theory, mathematics and statistics constitutes what is called econometrics.
• Econometric methods are helpful in explaining the stochastic relationship in math- ematical format among variables.
• In an econometric model, the dependent variable is called an explained variable and the independent variables are called explanatory variables.
• The random error or disturbance term tells us about the parts of the dependent variable that cannot be predicted by the independent variables in the equation.
• The formulation of economic models in an empirically testable form is an econo- metric model.
• An economic model provides a theoretical relation, but an econometric model is a relationship used to analyse real-life situation.
• The choice of variables in the econometric model is determined by the economic theory as well as data considerations.
• The error term or disturbance term is perhaps the most important component of any econometric analysis.
The conditional mean function, known as the population regression function, serves as the econometric model representing the population It illustrates the relationship between the expected value of the population regressand (y) and the population regressors (x).
• The econometric model based on the sample is called the sample regression func- tion.
• The parametric econometric model is based on the prior knowledge of the func- tional form relationship.
• The nonparametric econometric model is specified endogenously on the basis of the data.
• When a regression model is specified in linear form in terms of log of the variables, the regression coefficients measure the proportional change.
• The method of moments utilises the moment conditions relating to zero uncondi- tional and conditional mean of the random errors.
• The least squares principles suggest that we should select the estimators of the parameters so as to minimise the residual sum of square.
• Maximum likelihood estimators are obtained by maximising the probability of observation of the responses.
• Hypothesis testing is a statistical process to test the claims or ideas about a popu- lation on the basis of a sample drawn from it.
• The null hypothesis (H 0) is a statement about a population parameter.
The alternative hypothesis (H1) serves as a counterstatement to the null hypothesis, asserting that the true value of a population parameter differs from the value proposed in the null hypothesis, whether it is less than, greater than, or not equal to that value.
• The alternative hypothesis determines which tail of a sampling distribution to place the level of significance.
• The rejection region is the region beyond a critical value in a hypothesis test.
• The test statistic is a value obtained by exploiting the nature of the sampling distribution of the statistic.
• Type I error is the probability of rejecting a null hypothesis that is actually true.
• Type II error is the probability of retaining a false hypothesis.
• Cross section data are not stochastic in nature.
• The data generating process of time series is stochastic, and the realisation of time series data is characterised by a joint probability density function.
• A time series for each cross section unit forms a set of panel data or longitudinal data.
Davidson, R., and J.G MacKinnon 1981 Several Tests for Model Specification in the Presence of Alternative Hypotheses Econometrica 49: 781–793.
Fan, J., and Q Yao 2003 Nonlinear Time series: Nonparametric and parametric methods New
Ramsey, J.B 1969 Tests for Specification Errors in Classical Linear Least-Squares Analysis Jour- nal of the Royal Statistical Society: Series B 71: 350–371.
Wooldridge, J.M 1994 A Simple Specification Test for the Predictive Ability of TransformationModels Review of Economics and Statistics 76: 59–65.
Linear Regression Model: Properties and Estimation
Regression analysis aims to estimate unknown model parameters, validate the model's functional form against theoretical expectations, and predict future values of the response variable This chapter focuses on linear regression models and their application to cross-sectional data Linear regression estimates the conditional expected value of a dependent variable based on a set of independent variables Multiple regression analysis enhances this by allowing for ceteris paribus analysis, enabling control over various factors that simultaneously influence the dependent variable The strength of multiple regression lies in its ability to provide ceteris paribus interpretations, even when data collection does not adhere to this principle.
Regression analysis is a fundamental aspect of econometrics, focusing on linear regression models and their application to cross-sectional data This method estimates the conditional expected value of a dependent variable based on a set of independent variables The primary goal of linear regression analysis is to validate a hypothesized model and utilize it for predictive purposes Multiple regression analysis is particularly advantageous as it allows for ceteris paribus interpretations, enabling a more nuanced understanding of relationships using sample data that may not have been collected under strictly controlled conditions.
Introduction
Linear regression is a versatile statistical technique widely utilized across various fields, including economics, social sciences, physical sciences, and medical sciences, to empirically generate insights In economics, it serves to empirically validate existing theories and hypotheses proposed by researchers This method effectively investigates functional relationships among variables, even when those relationships are not perfectly defined.
P Das, Econometrics in Theory and Practice, https://doi.org/10.1007/978-981-32-9019-8_2
The 38 2 Linear Regression Model illustrates that while relationships can be predicted, they are not always perfectly deterministic For instance, as a person's height increases, their weight is likely to rise, but this relationship is not strictly linear Similarly, the correlation between smoking and lung cancer shows that while higher smoking rates generally lead to increased cancer incidence, the relationship is not perfectly linear.
A linear regression model establishes a relationship through a linear equation that links a dependent variable, such as the incidence of lung cancer, to one or more independent variables, including smoking rates and various socioeconomic and demographic factors In this context, the response variable is quantified by the number of individuals diagnosed with lung cancer, while the predictor variables encompass the number of smokers and additional demographic information.
Dependent and independent variables can be either scalars or vectors, with simple linear regression involving a single predictor variable, while multiple linear regression incorporates multiple predictors, making the independent variable a vector When a regression model includes two or more response variables, it is referred to as multivariate regression The key difference between simple and multiple regression lies in the number of predictor variables used Linear regression serves as a method for estimating the conditional expected value of the dependent variable based on the values of a set of independent variables.
This chapter explores both simple and multiple regression models, beginning with an overview of the simple linear regression model in Section 2.2 Section 2.3 outlines the fundamental structure of the multiple linear regression model, while Section 2.4 discusses the essential assumptions underlying linear regression In Section 2.5, we address the challenges associated with estimating linear regression models, and Section 2.6 presents the algebraic and statistical properties of ordinary least squares (OLS) estimation.