Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 36 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
36
Dung lượng
7,19 MB
Nội dung
2NATIONAL ECONOMICS UNIVERSITY SCHOOL OF ADVANCED EDUCATION PROGRAMS -*** - STATISTICS Sampling Distribution and Estimation (GROUP 1) Students: Nguyen Ha Linh 11213233 Nguyen Phuong Anh 11210589 Phi Hanh Nguyen 11214473 Nguyen Phuong Nhung 11214629 Tran Thi Minh Tam 11219369 Ha Noi, May2023 LIST OF TAB Y Table Baseline characteristics of the patients .9 Table Changes in coprimary end points and cardiometabolic risk factors between baseline and week 56 11 Table Characteristics of 6712 participants 13 Table Adherence to quality indicators, overall and according to type of care and function 14 Table Adherence to quality indicators, according to mode 14 Table Adherence to quality indicators, according to conditions* 16 Table Demographics, alcohol advertisment exposure, and market alcohol advertisement expenditure by mean alcohol use and changes in alcohol use over time .19 Table Hierarchical linear modeling parameter estimates predicting alcohol use for the total sample 20 Table Hierarchical linear modeling parameter estimates predicting alcohol use among 15to 20-year-olds 23 LIST OF FIGURES Figure Alcohol use over time by age in markets with high alcohol advertising expenditure per capta 21 Figure Alcohol use over time by age in markets with high alcohol advertising expenditure per capta 21 Figure Alcohol use by mean advertising exposure, market advertising expenditure per capita, and gender 22 Figure Biology scores (2021) .24 Figure Physics scores (2021) .24 Figure Sampling distribution of of Physics scores [n = 40, 100 samples] 25 Figure Sampling distribution of of Physics scores [n = 100, 100 samples] 26 Figure Sampling distribution of of Biology scores [n = 40, 100 samples] .27 Figure Sampling distribution of of Biology scores [n = 100, 100 samples] .27 Figure 10 Sampling distribution of of Biology scores [n = 300, 100 samples] 27 Figure 11 Biology scores of 40 random students 29 TABLE OF CONTENTS PART 1: INTRODUCTION I SAMPLING DISTRIBUTION 1 Sampling distribution of a mean 1.1 1.2 1.3 Sampling distributions of The central limit theorem .1 Sampling distribution of the mean of any population Sampling distribution of a proportion Sampling distribution of the difference between means T-distribution II ESTIMATION Concepts of Estimation 1.1 1.2 Confidence interval estimator of μ 2.1 2.2 Point and interval estimators .3 Desirable qualities of estimators General information The width of the interval Application of Estimation 3.1 3.2 3.3 Financial analysis Quality control .5 Medical research PART 2: ARTICLE SUMMARY I MAIN ARTICLE Background Purpose Methods Results 4.1 4.2 4.3 4.4 Trial population Body weight Glycemic control 10 Cardiometabolic variables 10 Conclusion 12 II SUB-ARTICLES .12 The Quality of Health Care Delivered to Adults in the United States 12 1.1 1.2 1.3 1.4 1.5 Background 12 Purpose 12 Methods 12 Results 13 Conclusion 17 Effects of Alcohol Advertising Exposure on Drinking Among Youth 17 2.1 2.2 2.3 2.4 2.5 Background 17 Purpose 17 Methods 17 Result 18 Conclusion 23 PART 3: DATA ANALYSIS 24 I DATASET 24 II Source 24 Descriptive information 24 Reasons for choosing the dataset 25 THEORY APPLICATION 25 Central Limit Theorem 25 Using sampling distribution for Inference 28 CONCLUSION .31 REFERENCES .32 PART 1: INTRODUCTION I SAMPLING DISTRIBUTION - - The distribution formed by all the possible values for sample statistics obtained for every possible different sample of a given size is called the sampling distribution Two ways to create a sampling distribution: + Draw samples of the same size from a population, calculate the statistics of interest, and then use descriptive techniques + Use the rules of probability and the laws of expected value and variance to derive the sampling distribution The primary function of the sampling distribution is statistical inference Sampling distribution of a mean 1.1 Sampling distributions of - For each value of , the mean of the sampling distribution of is the mean of the population from which we’re sampling: - The variance of the sampling distribution of the sample mean is the variance of the population divided by the sample size: - The standard deviation of the sampling distribution is called the standard error of the mean; that is, 1.2 The central limit theorem - As n gets larger, the sampling distribution of becomes increasingly bell shaped - Central Limit Theorem states that: “The sampling distribution of the mean of a random sample drawn from any population is approximately normal for a sufficiently large sample size The larger the sample size, the more closely the sampling distribution of will resemble a normal distribution” 1.3 Sampling distribution of the mean of any population For infinitely large population - The mean of sampling distribution is always equal to the mean of the population - The standard error is equal to For finite population - The standard error is Notation: - The population size - The finite population correction factor Sampling distribution of a proportion - is approximately normally distributed provided that and are greater than or equal to (The standard deviation of is called the standard error of the proportion) Sampling distribution of the difference between means - The sampling plan calls for independent random samples drawn from each of two normal populations - Based on the central limit theorem, it has been shown that the difference between two independent normal random variables is also normally distributed Thus, the difference between two sample means is normally distributed if both populations are normal - By using the laws of expected value and variance we derive the expected value and variance of the sampling distribution of - The sampling distribution of is normal with mean and standard deviation (which is the standard error of the difference between two means) T-distribution General information - The t-distribution is a type of normal distribution that is used for smaller sample sizes - It is symmetric around 0, mound-shaped (like a normal), but has a higher variance than a normal distribution - The higher the degree of freedom, the more normal the curve looks Notation: Degree of freedom is the number of observations whose value are free to vary after calculating the sample mean How to use t-tables - Bottom row has = infinite, this is the standard normal probabilities - If df is very large, use Z tables even if is unknown - If df is not on tables as exact, use whatever is closest II ESTIMATION Concepts of Estimation The objective of estimation is to determine the approximate value of a population parameter on the basis of a sample statistics For example, the sample mean which is employed to estimate the population mean is referred as the estimator of the population mean Once the sample mean has been computed, its value is called the estimate 1.1 Point and interval estimators Point estimator A point estimator draws inferences about a population by estimating the value of an unknown parameter using a single value or point Drawbacks of using point estimators: - The estimate will be wrong - We often need to know how close the estimator is to the parameter - Point estimators don’t have the capacity to reflect the effects of larger sample sizes Interval estimator An interval estimator draws inferences about a population by estimating the value of an unknown parameter using an interval The interval estimator is affected by the sample size - Applications of estimation: Calculate the proportion of television viewers who are tuned in to a network Calculate the mean income of university graduates 1.2 Desirable qualities of estimators Unbiased estimator One desirable quality of an estimator is unbiasedness An unbiased estimator of a population parameter is an estimator whose expected value is equal to that parameter We want our estimators to be accurate and precise: - Accurate: On average, our estimator is getting towards the true value - Precise: Our estimates are close together Consistency Another desirable quality is consistency An unbiased estimator is said to be consistent if the difference between the estimator and the parameter grows smaller as the sample size grows larger Relative efficiency A third desirable quality is relative efficiency If there are two unbiased estimators of a parameter, the one whose variance is smaller is said to have relative efficiency Error of estimation The sampling error is defined as the difference between an estimator and a parameter We can also define this difference as the error of estimation Confidence interval estimator of μ 2.1 General information The confidence interval estimator is a probability statement about the sample mean It states that there is - probability that the sample mean will be equal to value such that the interval to will include the population mean In general, a confidence interval estimator for μ is given by Notation: o The probability of is called the confidence level o is called the lower confidence limit (LCL) o is called the upper confidence limit (UCL) A 95% confidence interval should be interpreted as saying “In repeated sampling, 95% of such intervals created would contain the true population mean” 2.2 The width of the interval The width of the confidence interval estimate is a function of the population standard deviation, the confidence level, and the sample size - Factors influence the width of the interval: Vary the sample size: As the sample size gets bigger, the interval gets narrower Vary the confidence level: Decreasing the confidence level narrows the interval, increasing it widens the interval Application of Estimation 3.1 Financial analysis - Estimation techniques are used to: Estimate the value of financial assets Estimate the risk of financial assets to develop risk management strategies Optimize portfolios by selecting a combination of assets that maximizes the expected return for a given level of risk 3.2 Quality control - - Estimation techniques are used to: Estimate the proportion of defective items in a batch or production run This information is used to determine whether the batch or run meets the quality standards or needs to be rejected Estimate the process capability index, which is a measure of the ability of a manufacturing process to produce products within specification limits This information is used to determine whether the process is capable of producing products within the required quality standards 3.3 Medical research - Estimation techniques are used to estimate treatment effects and to determine whether a medical treatment is effective or not - Estimation techniques are used to estimate the prevalence and incidence of diseases in a population Estimation techniques such as point estimation and interval estimation are used to estimate the prevalence and incidence of diseases and to determine whether they are statistically significant In addition, the results show the unit-specific models and the event rate ratios The event rate ratio, can be interpreted as the percentage change in the dependent variable associated with an increase of unit in the independent variable, holding other factors constant Control variables included age, gender, ethnicity, high school or college enrollment, and alcohol sales 2.4 Result Youth who saw more alcohol advertisements on average drank more (each additional advertisement seen increased the number of drinks consumed by 1% [event rate ratio, 1.01; 95% CI, 1.01- 1.02]) Youth in markets with greater alcohol advertising expenditures drank more (each additional dollar spent per capita raised the number of drinks consumed by 3% [event rate ratio,1.03; 95% CI, 1.01- 1.05]) Examining only youth younger than the legal drinking age of 21 years, alcohol advertising exposure and expenditures still related to drinking Youth in markets with more alcohol advertisements showed increases in drinking levels into their late 20s, but drinking plateaued in the early 20s for youth in markets with fewer advertisements Control variables included age, gender, ethnicity, high school or college enrollment, and alcohol sales 2.4.1 Level 1: Differences within individuals over time in advertising exposure Alcohol advertising exposure at level was centered on the individual’s mean alcohol advertising exposure across all observations 18 Table Demographics, alcohol advertisment exposure, and market alcohol advertisement expenditure by mean alcohol use and changes in alcohol use over time Table shows that 61% percent of the sample had at least drink in the past month at baseline Drinkers consumed 38.5 total drinks on average in the past month at baseline (95%CI, 34.3-42.7), imbibing an average of 4.5 drinks per episode (95% CI, 4.3- 4.8) Drinkers younger than 21 years had 29 drinks on average at baseline, with 4.5 drinks on average each drinking session (95% CI, 4.1-4.8) Individuals reported seeing an average of 22.7 alcohol advertisements per month at baseline 2.4.2 Level 2: Differences between individuals in advertising exposure The individual’s mean advertising exposure was added as an independent variable at level The results in Table show that advertising exposure was positively related to an increase in drinking Holding other factors constant, individuals who saw more advertisement average than other individuals had 1% more alcoholic drinks per month (event rate ratio, 1.01; 95% CI, 1.01-1.02) Within-individual variation in advertising exposure was not a statistically significant factor in drinking, so whether a youth saw more or fewer advertisements in a particular month than he or she typically saw was not as important a determinant of drinking as that person’s average level of advertising exposure over time 19 Table Hierarchical linear modeling parameter estimates predicting alcohol use for the total sample 2.4.3 Level 3: Market-level advertising expenditures Advertising spending in a market is linked to drinking levels and growth in drinking over time Spending one extra dollar per capita on advertising in the market increases alcoholic beverage consumption by 3% per month (event rate ratio, 1.03; 95% CI, 1.011.05), holding constant other factors, including time This effect is stronger among older youth in markets with high advertising expenditures per capita, where 25-year-olds consume close to 50 drinks per month (Figure 1) In markets with low advertising expenditures per capita, initial drinking rates are lower than in markets with high advertising expenditures per capita (Figure 2) Younger age groups show a slower increase in drinking over time than peers in markets with high advertising expenditures per capita Drinking growth flattens out around the age of 22, with little increase thereafter Above the age of 23, drinking declines over time in markets with low advertising expenditures per capita, declining most steeply in older age groups 20 Figure Alcohol use over time by age in markets with high alcohol advertising expenditure per capta Figure Alcohol use over time by age in markets with high alcohol advertising expenditure per capta To better illustrate the effects of the main variables of interest, Figure depicts the relationship among alcohol use, mean levels of advertising exposure, advertising expenditures per capita, and gender 21 Figure Alcohol use by mean advertising exposure, market advertising expenditure per capita, and gender The study also found that underage youth who reported higher levels of alcohol advertising exposure drank more (Table 9) Each additional advertisement exposure increased the amount of drinks consumed in the past month by 1% (event rate ratio, 1.01; 95% CI, 1.001-1.021), holding constant other factors Drinking levels were also higher among underage youth living in markets with greater advertising expenditures (event rate ratio, 1.03; 95% CI, 1.00-1.06), holding constant other factors A 3-way interaction effect occurred among time, age, and market advertising expenditures, following similar growth curves to those in Figure and Figure 22 Table Hierarchical linear modeling parameter estimates predicting alcohol use among 15- to 20-year-olds 2.5 Conclusion Alcohol advertising contributes to increased drinking among youth The results of the study provide evidence that the amount of advertising expenditures in 15- to 26-year-olds’ media environment and the amount of advertising recalled related to greater youth drinking Youth younger than the legal drinking age displayed a similar pattern of advertising effects as the entire age range This is important because there is often a greater policy interest in protecting underage youth from harmful communications than in protecting youth older than 21 23 PART 3: DATA ANALYSIS I DATASET Source The dataset of the university entrance scores in 2021 is retrieved from GitHub: https://github.com/khoingo123/diem-thi-dai-hoc-2021 While having the scores for all subjects, Biology & Physics scores are selected to analyze and apply statistical theories, with the former being our main data to analyze By selecting GitHub as our source, the data is slightly fewer than the actual population that has been reported in government statistical report However, with the coverage of up to 95% of the official data, our research team decided to assume this dataset as the whole population Descriptive information N Minimum Maximum Mean Std Deviation Statistic Statistic Statistic Statistic Statistic Biology scores 326369 Valid N (listwise) 326369 000 9.750 5.52261 Figure Biology scores (2021) 24 1.424525 Skewness Statistic 328 Std Error 004 Figure Physics scores (2021) Reasons for choosing the dataset The primary function of the sampling distribution is statistical inference This statistical inference works in ways: - Knowing the population means and standard deviations (assuming the population is not extremely non-normal) enabled us to calculate a probability statement about the sample means - Knowing the sample means, standard deviations, and size enabled us to calculate a probability statement about the population means By selecting the biology score of the university entrance test (2021), all values of the population and sample are available, enabling our team to create different scenarios to apply the following statistical techniques and theories: Central Limit Theorem, Confidence Interval estimator of Especially, to better illustrate the Central Limit Theorem, our team also use the Physics score in the dataset II THEORY APPLICATION Central Limit Theorem sample sizes (n=40 and n = 100) were randomly selected 100 times from the Physics to provide sample distribution diagrams, illustrating the Central Limit Theorem 25 Figure Sampling distribution of of Physics scores [n = 40, 100 samples] Figure Sampling distribution of of Physics scores [n = 100, 100 samples] As the Physics score is slightly negatively skewed, the larger the value of n, the more normally will be distributed1 This has reaffirmed the Central Limit Theorem As for the data of Biology scores, different sample sizes (40 samples, 100 samples, 300 samples) were randomly selected 100 times to provide sample distribution diagrams: Skewness: -.141(n=40); -.022(n=100) 26 Figure Sampling distribution of of Biology scores [n = 40, 100 samples] Figure Sampling distribution of of Biology scores [n = 100, 100 samples] 27 Figure 10 Sampling distribution of of Biology scores [n = 300, 100 samples] Since the Biology score is normally distributed, of this population is normally distributed for all values of n2 This has again reaffirmed the Central Limit Theorem Using sampling distribution for Inference From the dataset, important information about the population has been made available: - The biology scores are normally distributed - The average score (population mean): = 5.523 - The standard deviation: = 1.425 2.1 Inference about the sample mean Scenario The average Biology score on the university test (2021) is 5.523, with a standard deviation of 1.425 If I randomly choose a class of 40 students, what is the probability that their average score is higher than 5.00? From the central limit theorem, we know the following: is normally distributed 5.523 s = / = / = 0.225 Hence, Skewness: -.022(n=40); -.084(n=100); 402(n=300) 28 =P= = - = – 0.0102 = 0.9898 Interpretation: The probability of observing an average score of a 40-student class that is higher than 5.00 when the average Biology score is 5.523 is very high 2.2 Estimation: Inference about the population mean Scenario From the previous years, we know that the biology score is a random variable X~N(μ, 1.425) A sample of 40 Biology test scores is taken from the population The sample average score is 5.83, and the standard deviation of the sample is 1.371 Determine the 95% confidence intervals for the population average score Figure 11 Biology scores of 40 random students TẠI SAO LẠI LẤY ST DEVIATION CỦA POPULATION LẮP VÀO CÔNG THỨC??? Interpretation: In repeated sampling, 95% of the intervals created this way would contain the true average biology score 29 Scenario Now assume that we don’t have any information about the biology score of the population A sample of 40 Biology test scores is taken from the population The sample average score is 5.83, and the standard deviation of the sample is 1.371 Determine the 95% confidence intervals for the population average score Interpretation: In repeated sampling, 95% of the intervals created this way would contain the true average biology score 2.3 Determination of sample size Scenario Assume that a student just finished her/his Biology test and wanted to know the average score of this test Based on her/his research from previous years, the standard deviation of the biology score is around 1.425 How many friends does she need to ask to be 99% confident that her/his sample average score will be within 0.5 of the true average score? Step Set up the equation Step Standardize P P Step Solve for n P n = 53.857 Interpretation: The student needs to ask at least 54 friends to estimate the true average score lying in Confidence Interval of 0.5, with 99% certainty different research question: - determine 95% confidence intervals for the population average score - how big sample size has to be to have the correct 99% confidence interval 30 CONCLUSION In conclusion sampling distribution and estimation are essential for inferential statistics Sampling distribution is a statistic that determines the probability of an event based on data from a small group within a large population Its primary purpose is to establish representative results of small samples of a larger population Researchers benefit from sampling distributions to make estimates and inferences about larger populations of interest based on the data that they have access to Estimation provides estimates of population parameter with a desired degree of confidence Estimation allows researchers to make precise inferences and establish a stronger grasp on the understanding of a population in general Many fields, including finance, medicine, social sciences, and engineering, use sampling distribution and estimation These powerful tools can significantly benefit our life 31 REFERENCES Elizabeth A McGlynn, Ph.D., Steven M Asch, M.D., M.P.H., John Adams, Ph.D., Joan Keesey, B.A., Jennifer Hicks, M.P.H., Ph.D., Alison DeCristofaro, M.P.H., and Eve A Kerr, M.D., M.P.H (2003) The Quality of Health Care Delivered to Adults in the United States The New England Journal of Medicine Retrieved June 26, 2003, from https://www.nejm.org/doi/full/10.1056/nejmsa022615?fbclid=IwAR2Zi4XNxG0WyLCo3Wv_ZvddZutTKg9jcSuyFEFsEIlXQzf9z5wDrhxMJ0 Keller, G (2017) Chapter 6,7,8 In Statistics for Management and Economics (11th ed., pp 286 – 332) essay, Cengage Leslie B Snyder, PhD; Frances Fleming Milici, PhD; Michael Slater, PhD; Helen Sun, MA; Yuliya Strizhakova, PhD (2006) Effects of Alcohol Advertising Exposure on Drinking Among Youth JAMA Network Retrieved December 10, 2015, from https://jamanetwork.com/journals/jamapediatrics/fullarticle/204410 Xavier Pi-Sunyer, M.D., Arne Astrup, M.D., D.M.Sc., Ken Fujioka, M.D., Frank Greenway, M.D., Alfredo Halpern, M.D., Michel Krempf, M.D., Ph.D., David C.W Lau, M.D., Ph.D., Carel W le Roux, F.R.C.P., Ph.D., Rafael Violante Ortiz, M.D., Christine Bjørn Jensen, M.D., Ph.D., and John P.H Wilding (2015) A Randomized, Controlled Trial of 3.0 mg of Liraglutide in Weight Management The New England Journal of Medicine Retrieved July 2, 2015, from https://www.nejm.org/doi/full/10.1056/nejmoa1411892? fbclid=IwAR2b_4aGUNHUQ19eVfLaIFWHjMxVFDtFyCiFNeERUApAwsIBpWua0ufhFk 32