Do Statistics Lie?
“I don’t trust any statistics I haven’t falsified myself.”
“Statistics can be made to prove anything.”
One often hears statements such as these when challenging the figures used by an opponent Benjamin Disreali, for example, is famously reputed to have declared,
The famous saying, “There are three types of lies: lies, damned lies, and statistics,” highlights the potential for statistics to mislead Many people who are skeptical of statistics often find their doubts validated when two conflicting statistical analyses of the same situation yield opposite results This raises an important question: if statistics can be easily manipulated to bolster biased viewpoints, what value do they truly hold?
Despite the often casual reception of disparaging quotes, statistics are crucial for supporting argumentative claims Daily newspapers feature various tables, diagrams, and figures, highlighting the importance of data in public discourse Each month brings significant attention to new economic forecasts, survey results, and consumer confidence metrics Additionally, countless investors depend on financial analysts' market forecasts to guide their investment choices.
Statistics can evoke skepticism in certain situations while simultaneously exuding authority in others, leading to a paradox Despite Disraeli's assertion that "there are three kinds of lies: lies, damned lies, and statistics," individuals and organizations continue to depend on statistics for decision-making Swoboda (1971) identifies two key reasons for this ambivalence towards statistical methods.
Chapter 1 Translated from the German original, Cleff, T (2011) 1 Statistik und empirische Forschung In Deskriptive Statistik und moderne Datenanalyse (pp 1–14) # Gabler Verlag, Springer Fachmedien Wiesbaden GmbH, 2011.
T Cleff, Exploratory Data Analysis in Business and Economics,
DOI 10.1007/978-3-319-01517-0_1, # Springer International Publishing Switzerland 2014
• First, there is alack of knowledgeconcerning the role, methods, and limits of statistics.
• Second, many figures which are regarded as statistics are in fact pseudo- statistics.
Since the 1970s, the rise of computer technology has made statistical analysis more accessible, allowing individuals with basic arithmetic skills to produce statistical tables and graphics using readily available software However, this accessibility often leads to violations of fundamental methodological principles, resulting in incomplete or misleading information Both journalists and readers frequently misinterpret or misrepresent statistics, and this issue extends to scientific literature, where pseudo-statistics—data derived from incorrect methods or fabricated entirely—are not uncommon As Kra¨mer (2005) notes, statistics can be manipulated either intentionally or through improper selection, highlighting the dual nature of statistics as both a tool for understanding and a potential source of misinformation.
This book aims to address the prevalence of misleading statistics and erroneous interpretations that often distort our understanding of data With examples of improper calculations, suggestive questioning, and distorted samples, we emphasize the importance of mastering quantitative methods in a data-driven world As Goethe noted, "the numbers instruct us," highlighting the critical role of statistical models in microeconomic analysis and business decision-making Our goal is to not only present key statistical methods and their applications but also to enhance the reader's ability to identify errors and manipulation in data.
Many people believe that common sense is enough for understanding statistics, but those with formal statistical training know this isn't true Textbooks inevitably incorporate formulas because qualitative descriptions have limited utility For instance, when students inquire about the failure rate on a statistics test, they expect a precise quantitative answer, like 10%, which necessitates calculations and the use of formulas.
This book includes a formal presentation of mathematical methods that should not be overlooked However, readers who possess a solid understanding of basic analytical principles will find the material accessible and comprehensible.
Two Types of Statistics
What are the characteristics of statistical methods that avoid sources of error or attempts to manipulate? To answer this question, we first need to understand the purpose of statistics.
Statistical methods have been utilized since ancient times, with evidence dating back to the 6th century BC when Servius Tullius mandated regular censuses of citizens A well-known historical reference is found in the Bible, where it is stated that "Caesar Augustus issued a decree for a census of the entire Roman world," marking the first census during Quirinius's governorship in Syria, prompting everyone to return to their hometowns for registration (Luke 2.1-5).
Throughout history, politicians have sought to evaluate the wealth of their citizens, primarily for taxation rather than altruistic motives This pursuit involved collecting data about the population, enabling the ruling class to gain insights into the lands they governed Such efforts to compile information about a nation exemplify the foundational principles of statistics.
In the early days of statistical record keeping, comprehensive surveys aimed to count every individual, animal, and object By the early 20th century, tracking employment became a significant focus, but measuring unemployment proved challenging due to the vast numbers involved This period marked the emergence of descriptive statistics as a distinct field.
Descriptive statistics encompasses various techniques aimed at summarizing and describing data from a population This includes calculating figures and parameters, as well as creating graphs and tables to effectively present the information.
The development of inductive data analysis, which allows for drawing conclusions about a total population from a sample, emerged in the early 20th century, with key contributors including Jacob Bernoulli, Abraham de Moivre, Thomas Bayes, Pierre-Simon Laplace, Carl Friedrich Gauss, Pafnuti Chebyshev, Francis Galton, Ronald A Fisher, and William Sealy Gosset Their pioneering work enabled the use of various inductive techniques, eliminating the need to count and measure every individual in a population and facilitating more manageable surveys.
In 6/7 A.D., Judea, along with Edom and Samaria, was established as a Roman protectorate, likely linked to the census mandated by Quirinius, which aimed to register all residents and their properties for tax purposes Alternatively, this passage may also reference an earlier census conducted in 8/7 B.C.
In many situations, such as product design or election research, it is impractical and prohibitively expensive for firms or researchers to survey an entire population Instead, they opt to gather insights from a representative sample of potential customers or voters, which provides valuable information without the need for a complete survey This approach allows for effective analysis while minimizing costs and logistical challenges.
The assessment of collected data indicates that insights are based on a sample rather than a complete survey Consequently, the conclusions drawn carry a defined level of uncertainty, which can be quantified statistically This uncertainty represents the trade-off inherent in the simplified methodology of inductive statistics.
Descriptive and inductive statistics are essential scientific disciplines utilized across business, economics, natural sciences, and social sciences They involve methods for analyzing and describing mass phenomena through numerical data The primary objective is to derive conclusions about the characteristics of the subjects studied, whether through comprehensive surveys or partial samples Statistics provide a framework for making informed decisions amidst uncertainty, establishing them as a fundamental component of decision theory.
Statistics serve two primary purposes: Descriptive statistics summarize data meaningfully, converting it into actionable information This information, when analyzed through inductive statistical methods, leads to the generation of generalizable knowledge that can guide political and strategic decisions The relationship between data, information, and knowledge is illustrated in Figure 1.1.
The Generation of Knowledge Through Statistics
Statistics play a crucial role in the pursuit of new knowledge, as the process of knowledge generation in both science and professional practice often includes essential descriptive and inductive steps This significance can be illustrated through a clear example.
A market researcher in dentistry aims to explore the correlation between the price and sales volume of a specific toothpaste brand To achieve this, the researcher begins by collecting detailed market data and insights, which will help in understanding consumer behavior and preferences.
Analyzing weekly toothpaste prices and sales over the past three years reveals that higher prices lead to decreased sales as consumers shift to alternative brands, while lower prices result in increased sales This observed relationship, derived from descriptive statistics, aligns with the microeconomic principles of price and demand Although complete sales data may not be available for all stores, the insights gained from partial samples validate or challenge existing economic theories, highlighting the importance of descriptive statistics in understanding consumer behavior.
At this stage, the researcher must evaluate whether the insights gained from the partial sample are representative of the entire population, acknowledging that generalizable information in descriptive statistics is initially speculative By employing inductive statistical techniques, the researcher can estimate the error probability linked to applying these insights to the overall population Ultimately, the researcher must determine the acceptable level of error probability that would render the insights qualified and applicable to the broader context.
Note: The figure shows the average weekly prices and associated sales volumes over a 3 year period Each point represents the amount of units sold at a certain price within a given week
Fig 1.2 Price and demand function for sensitive toothpaste
1.3 The Generation of Knowledge Through Statistics 5
Even with complete sales data from all stores, it remains essential to question whether the established relationship between price and sales will persist in the future Since future data is unavailable, we must rely on past information to make forecasts This forecasting process is crucial for testing theories, assumptions, and expectations, ultimately transforming information into generalizable knowledge for the firm.
Descriptive and inductive statistics serve distinct yet essential roles in the research process, making it beneficial to examine each area individually and highlight their differences and similarities In academic settings, university statistics courses often cover these two domains in separate lectures to facilitate a clearer understanding.
The Phases of Empirical Research
From Exploration to Theory
The development of a theory is essential for advancing knowledge, despite practitioners' reluctance to use the term due to fears of appearing overly academic Originating from the Greek word "theorema," which means to view or investigate, a theory represents a speculative description of relationships within a system It relies on observing and linking individual events, requiring verification to be deemed generally applicable An empirical theory connects these events to deduce the origins of specific conditions, establishing a unified terminology for understanding cause-and-effect relationships For instance, in analyzing toothpaste sales, researchers must identify key factors influencing sales, such as product pricing, competitor pricing, advertising efforts, and target demographics.
In quantitative studies, effective communication is crucial, as feedback loops for self or third-person verification play a significant role in both the Problem Definition and Theory Phases Engaging with outside experts, such as product managers, is essential to uncover hidden events and influences that may impact the study This collaborative approach extends to other departments; for instance, purchasing agents should be consulted for procurement processes, while engineers are key contacts for R&D projects By gathering diverse perspectives, researchers not only enhance their understanding of causes and effects but also avoid the potential embarrassment of overlooking critical influencing factors in their findings.
From Theories to Models
Once the theoretical interrelationships governing a specific situation are established, the construction of a model can commence While the terms "theory" and "model" are often used interchangeably, it is important to note that "theory" specifically refers to a language-based description of reality If mathematical expressions are considered a form of language, they play a crucial role in this process.
Decision Field Work & Assessment Research Design Formulation Theory Problem Definition
• Specify the measurement and scaling procedures
• Construct and pretest a questionnaire for data collection
• Specify the sampling process and sample size
• Develop a plan for data analysis
• Specify an analytical, verbal, graphical, or mathematical model
• Specify research questions and hypotheses
• Establish a common understanding of the problem and potential interrelationships
• Conduct discussions with decision makers and interviews with experts
• First screening of data and information sources
• This phase should be characterized by communication, cooperation, confidence, candor, closeness, continuity, creativity
Fig 1.3 The phases of empirical research
1.4 The Phases of Empirical Research 7 with its own grammar and semiotics, then a theory could also be formed on the basis of mathematics In professional practice, however, one tends to use the termmodelin this context – a model is merely a theory applied to a specific set of circumstances. Models are a technique by which various theoretical considerations are com- bined in order to render an approximate description of reality (Fig.1.4) An attempt is made to take a specific real-world problem and, throughabstractionandsimplifica- tion, to represent it formally in the form of a structurally cohesive model The model is structured to reflect the totality of the traits and relationships that characterize a specific subset of reality Thanks to models, the problem of mastering the complexity that surrounds economic activity initially seems to be solved: it would appear that in order to reach rational decisions that ensure the prosperity of a firm or the economy as a whole, one merely has to assemble data related to a specific subject of study, evaluate these data statistically, and then disseminate one’s findings In actual practice, how- ever, one quickly comes to the realization that the task of providing a comprehensive description of economic reality is hardly possible, and that the decision-making process is an inherently messy one The myriad aspects and interrelationships of economic reality are far too complex to be comprehensively mapped The mapping of reality can never be undertaken in a manner that is structurally homogenous – or, as one also says,isomorphic No model can fulfil this task Consequently, models are almost invariably reductionist, orhomomorphic.
The accuracy of a model in representing reality is inherently limited, often constrained by practical considerations A model should maintain a balance, avoiding excessive complexity that renders it unmanageable while still capturing the essential properties and relationships relevant to the problem it aims to analyze Essentially, models serve as mental constructs composed of abstractions, enabling us to depict complex situations and processes that are not directly observable Ultimately, a model is merely an approximation of reality.
Simulation involves creating simplified representations of complex realities, utilizing various methods to illustrate individual relationships Among these, physical or iconic models—such as dioramas, maps, and blueprints—are particularly vivid However, due to the abstract nature of economic relationships, depicting them through physical models poses significant challenges.
Symbolic models play a crucial role in economics, utilizing language as a system of signs governed by syntactic and semantic rules to explore and represent complex circumstances When everyday language or specialized jargon is used, we refer to these as averbal models or theories A verbal model consists of symbolic signs and words, but they do not inherently convey meaning For instance, the phrase “Spotted lives in Chicago my grandma rabbit” lacks coherence, and even a syntactically correct rearrangement, like “My grandma is spotted and her rabbit lives in Chicago,” may still fail to convey a logical idea Meaning emerges in verbal models only when semantics are considered, linking the elements in a coherent manner, such as “My grandma lives in Chicago and her rabbit is spotted.”
Artificial languages, including logical and mathematical systems, are classified as symbolic models that utilize character strings (variables) organized in a syntactically and semantically coherent manner within equations For instance, in the context of our toothpaste example, we can develop a verbal model or theory to illustrate this concept.
• There is an inverse relationship between toothpaste sales and the price of the product, and a direct relationship between toothpaste sales and marketing expenditures during each period (i.e calendar week).
The formal symbolic model can be represented as: \( y_i = f(p_i, w_i) = \alpha_1 p_i + \alpha_2 w_i + \beta \), where \( p_i \) denotes the price at time \( i \), \( w_i \) represents marketing expenditures at time \( i \), \( \alpha \) indicates the effectiveness of each variable, and \( \beta \) is a constant term.
Both models discussed are homomorphic partial models, focusing solely on the sale of a single product without considering other factors like employee headcount This limited scope contrasts with the comprehensive analysis expected from total models However, creating total models is often labor-intensive and costly, which is why they are primarily developed by economic research institutes.
Stochastic, homomorphic, and partial models are essential in statistics, often challenging for business and economics students The term "stochastic" refers to stochastic analysis, a form of inductive statistics focused on assessing non-deterministic systems This analysis involves understanding chance and randomness, particularly when the causes behind certain events are unknown, highlighting the non-deterministic nature of these occurrences.
1.4 The Phases of Empirical Research 9 future events or a population that we have surveyed with a sample, it is simply impossible to make forecasts without some degree of uncertainty Only the past is certain The poor chap in Fig.1.5demonstrates how certainty can be understood differently in everyday contexts.
Economists often struggle with the inherent uncertainty of life, which they must accept rather than eliminate To navigate this uncertainty, they employ inductive statistics and stochastic analysis to estimate the likelihood of events occurring For instance, a young man would likely feel little reassurance if his companion claimed a 95% chance of returning the next day, highlighting that everyday language—such as "yes" or "no"—often involves conjecture regarding future events Nevertheless, statistics should not be criticized for its uncertain predictions, as it seeks to quantify both certainty and uncertainty, acknowledging the randomness and unpredictability that are part of daily life.
Another important aspect of a model is its purpose In this regard, we can differentiate between the following model types:
• Explanatory models or forecasting models
• Decision models or optimization models
The question asked and its complexity ultimately determines the purpose a model must fulfil.
So, I’ll see you tomorrow?
Fig 1.5 What is certain? (Source: Swoboda 1971, p 31)
Descriptive models aim to represent reality through structured frameworks without formulating general hypotheses about causal relationships in actual systems For instance, a profit and loss statement serves as a model to illustrate a company's financial status, yet it does not explore or depict the causal connections between the various elements within the statement.
Explanatory models aim to formalize theoretical assumptions regarding causal relationships and validate these assumptions through empirical data analysis By employing an explanatory model, researchers can identify interconnections among various factors related to firms and make future projections based on these insights When specifically focused on predicting future outcomes, these models are referred to as forecasting models, which are considered a subset of explanatory models.
To return to our toothpaste example, the determination that a price reduction of
A price increase of €0.10 can significantly impact sales, as demonstrated by the model predicting a sales boost of 10,000 tubes of toothpaste Conversely, if we anticipate that a €0.10 price hike this week will result in decreased sales next week, we are utilizing a forecasting model to project future outcomes.
Decision models, or optimization models, are defined by Grochla (1969) as systems of equations designed to provide actionable recommendations These models focus on achieving optimal decisions through a mathematical target function that users aim to optimize while meeting specific constraints Predominantly utilized in Operations Research, decision models are less frequently applied in statistical data analysis (Runzheimer et al., 2005) In contrast, simulation models aim to replicate processes, such as production phases, using random-number generator functions in statistical software to reveal interdependencies between processes and stochastic factors like production rate variance Additionally, simulations can also be found in role-playing exercises during leadership seminars or Family Constellation sessions.
From Models to Business Intelligence
Statistical methods provide valuable insights into complex situations, although mastering all analytical techniques requires skill and experience For instance, a professor might enthusiastically explain the Heckman Selection Model to business professionals, but many listeners may quickly feel lost and uncertain about their understanding, realizing they are not alone in their confusion.
1.4 The Phases of Empirical Research 11 equally confused The audience slowly loses interest, and minds wander After the talk is over, the professor is thanked for his illuminating presentation And those in attendance never end up using the method that was presented.
Effective presenters recognize the importance of avoiding excessive technical jargon, ensuring that their findings are communicated clearly to a general audience The primary goal of data analysis is not merely the analysis itself, but the effective communication of results in a way that resonates with decision-makers Only when findings are understood and accepted can they influence decisions and shape future outcomes Therefore, analytical processes should be conducted with a clear focus on the informational needs of management, even if those needs are not explicitly defined.
The communication of findings is a crucial aspect of any analytical project, representing the final phase of the study The intelligence cycle, as illustrated in the accompanying figure, outlines the systematic processes involved in constructing and implementing a decision model (Kunze 2000, p 70) This cycle encompasses the acquisition, gathering, transmission, evaluation, and analysis of raw information, ultimately producing finished intelligence for policymakers (Kunze 2000, p 70) Essentially, the intelligence cycle serves as an analytical framework that converts disaggregated data into actionable strategic knowledge (Bernhardt 1994, p 12).
In the next chapter, we will focus on the activities involved in the assessment phase, where raw data is collected and converted into strategically relevant information using descriptive assessment methods, as illustrated in the intelligence cycle.
Fig 1.6 The intelligence cycle (Source: Own graphic, adapted from Harkleroad 1996, p 45)
Data Collection
The intelligence cycle begins with data collection, where businesses often gather vital information on expenditures and sales However, many fail to consolidate this data into a central database for systematic evaluation The statistician's primary role is to extract this valuable information, which often necessitates persuasive skills, as employees may be reluctant to share data that could expose previous shortcomings.
Before analyzing systematically collected data, firms must prepare by addressing key questions about authorization, skill sets, and time availability Businesses often grapple with these challenges, especially when managing extensive datasets from customer loyalty programs The administrative workload can overwhelm an entire department, delaying the systematic evaluation of the data.
Businesses can enhance their data collection by utilizing public databases, which are often available for free from research institutes, government statistics offices, and international organizations like Eurostat, the OECD, and the World Bank Additionally, private marketing research firms such as ACNielsen and the GfK Group offer databases for a fee These resources can provide valuable insights to inform business decisions For a comprehensive list of useful data sources, refer to Table 2.1.
Public data plays a crucial role in enhancing business decisions, particularly for a procurement department in a company that manufactures intermediate goods for machine construction By leveraging this data, businesses can effectively reduce costs, optimize inventory levels, and refine their overall procurement strategies.
Chapter 2 Translated from the German original, Cleff, T (2011) 2 Vom Zahlenwust zum Datensatz In Deskriptive Statistik und moderne Datenanalyse (pp 15–29) # Gabler Verlag, Springer Fachmedien Wiesbaden GmbH, 2011.
T Cleff, Exploratory Data Analysis in Business and Economics,
DOI 10.1007/978-3-319-01517-0_2, # Springer International Publishing Switzerland 2014
The procurement department is responsible for forecasting stochastic demand for materials and operational supplies, often relying on the Ifo Business Climate Index rather than sales department projections, which tend to be overly optimistic By analyzing this index, the procurement team can create an accurate forecast for the end-user industry over the next six months If the index indicates a downward trend in the end-user industry, it suggests a potential decline in sales for the manufacturing company, allowing the procurement department to make informed ordering decisions based on reliable public data.
Public data often exists in various levels of aggregation, typically representing groups rather than individual entities For instance, the Centre for European Economic Research (ZEW) conducts surveys on industry innovation that focus on collective data, such as R&D expenditures among chemical companies with 20 to 49 employees, enabling individual firms to benchmark their performance Similarly, the GfK household panel tracks the purchasing behavior of households as a whole rather than individuals, while loyalty card data offers aggregate insights that cannot reliably be traced back to specific cardholders, reflecting household activity instead of individual purchases.
Table 2.1 External data sources at international institutions
German federal statistical office destatis.de Offers links to diverse international data bases
Eurostat epp.eurostat.ec. europa.eu
OECD oecd.org Various databases
Worldbank worldbank.org World & country-specific development indicators
UN un.org Diverse databases
ILO ilo.org Labour statistics and databases
IMF imf.org Global economic indicators, financial statistics, information on direct investment, etc.
The Ifo Business Climate Index, published monthly by Germany’s Ifo Institute, is derived from a survey of approximately 7,000 companies across the manufacturing, construction, wholesaling, and retail sectors This comprehensive survey assesses various factors including the current business climate, domestic production, inventory levels, demand, domestic pricing, changes in orders from the previous month, foreign orders, export activity, employment trends, and outlooks for prices and business conditions over the next three to six months.
2 For more, see the method described in Chap 5.
Surveys are essential for gathering information about individuals or businesses, although they can be the most costly method of data collection They enable companies to tailor their own questions and can be conducted orally or in writing While questionnaires remain the traditional format, telephone and online surveys are gaining popularity.
Level of Measurement
This textbook does not cover all the rules for constructing questionnaires; for a comprehensive understanding, readers should refer to additional resources, such as Malhotra (2010) Therefore, we will concentrate on the key criteria for selecting a specific quantitative assessment method.
To assess customer preferences in a small-town grocery store, you decide to conduct a survey after receiving multiple requests to expand your selection of butter and margarine Due to limited display and storage space, it's crucial to determine if these requests reflect the broader preferences of your clientele Therefore, you enlist a group of students to administer a short questionnaire to gather valuable insights.
In just one week, students gathered questionnaires from 850 customers, each serving as a statistical unit characterized by specific traits such as sex, age, body weight, preferred bread spread, and selection rating For instance, a customer named Mr Smith is identified as male, 67 years old, weighing 74 kg, preferring margarine, and rating it as fair Each survey necessitates defining the statistical unit, relevant traits, and possible trait values Variables in the study are classified into discrete, which can only take specific values like family size, and continuous, which can assume any value within a range, such as weight or height Overall, the statistical units represent the subjects of the survey, differing in their trait values, with gender, selection rating, and age illustrating the nominal, ordinal, and cardinal scales of measurement in quantitative analysis.
The nominal scale represents the lowest level of measurement, where numbers are assigned to different traits, such as 1 for male and 2 for female This type of variable is often called a qualitative variable or attribute The assigned values categorize each statistical unit into specific groups, like male or female respondents, allowing for differentiation between them Each statistical unit can belong to only one group, and all units sharing the same trait receive the same number Importantly, these numbers solely indicate group membership and do not convey any qualitative information.
The level of measurement can be categorized into distinct groups, such as larger/smaller, less/more, or better/worse These categories serve to indicate membership or non-membership within a specific group, highlighting the differences between elements (x i ẳx j versus x i 6ẳx j) In the context of traits, this classification is essential for understanding variations in characteristics.
1formaleis no better or worse than a2forfemale; the data are merely segmented in Fig 2.1 Retail questionnaire
Statistical analysis of male and female respondents reveals that rank is not significant in various nominal traits, such as profession (e.g., butcher, baker, chimney sweep), nationality, and class year.
The ordinal scale is the next highest level of measurement, where numbers represent ranked traits rather than mere categories This scale often utilizes a range from 1 to x, as seen in trait selection ratings in surveys It enables researchers to compare the intensity of traits among statistical units For instance, if Ms Peters and Ms Miller both select the third option on the selection rating, it indicates they share a similar perception of the store’s selection Conversely, if Mr Martin chooses the fourth box, it signifies that he views the selection more favorably than Ms Peters and Ms Miller The ordinal scale allows for ordering traits, resulting in comparisons such as larger/smaller, less/more, and better/worse.
The exact distance between the third and fourth boxes remains unknown, and we cannot assume that the gap between the first and second boxes is equivalent to those of neighboring boxes A practical illustration of this concept is seen in athletic competition standings, where the difference in placement does not necessarily reflect a proportional difference in performance For instance, in a swimming race, the time difference between first and second place may be just one one-thousandth of a second, while third place could be two seconds behind, despite only one position separating each competitor.
The metric or cardinal scale represents the highest level of measurement, incorporating the ordinal scale's comparative information—such as larger/smaller and better/worse—while also providing precise distances between value traits in statistical units For instance, in the context of age, a 20-year-old is not only older than a 15-year-old but also has a measurable age difference of five years.
An 18-year-old is two years younger than a 20-year-old, highlighting that the age gap remains consistent across different stages of life For instance, the difference between a 20-year-old and a 30-year-old is the same as that between an 80-year-old and a 90-year-old This illustrates the principle of cardinal scales, where graduations are always equidistant Other common examples of cardinal scales include measurements of currency, weight, length, and speed.
Cardinal scales are categorized into absolute, ratio, and interval scales, but these distinctions are often academic and do not significantly influence the choice of statistical methods In contrast, the difference between cardinal and ordinal scale variables is crucial Due to the wider range of analytical techniques available for cardinal scales compared to ordinal scales, researchers frequently perceive ordinal variables as having cardinal characteristics.
3 A metric scale with a natural zero point and a natural unit (e.g age).
4 A metric scale with a natural zero point but without a natural unit (e.g surface).
5 A metric scale without a natural zero point and without a natural unit (e.g geographical longitude).
Researchers often assume that the gradations on a five-point rating scale are identical, a common practice in empirical studies While some acknowledge the assumption of equidistance, others provide justifications for it Schmidt and Opp (1976) suggest that ordinal scaled variables can be treated as cardinal if there are more than four possible outcomes and the survey includes over 100 observations However, interpreting a 0.5 difference between two ordinal scale averages remains challenging and can lead to confusion among researchers.
The scale of a variable is essential in selecting the appropriate statistical method For instance, with a nominal variable such as profession, calculating a mean value is not feasible, as it consists of categories like backers, butchers, and chimney sweeps This book will further explore the relationship between statistical methods and different levels of measurement.
Before data analysis can begin, the collected data must be transferred from paper to a form that can be read and processed by a computer We will continue to use the
850 questionnaires collected by the students as an example.
Scaling and Coding
The first step in conducting a survey is to define the level of measurement for each trait, as it is often impossible to raise this level after the survey is implemented For example, if respondents indicate their age by age group instead of in years, the variable remains on the ordinal scale, preventing the calculation of the average age later To avoid such limitations, it is advisable to establish the highest possible level of measurement in advance, such as using age in years or specific expenditures for consumer goods.
When commissioning a survey, clients may request that questions maintain a lower level of measurement to protect respondent anonymity This is often seen when a company's works council participates in the survey process Researchers typically have a responsibility to honor these requests to ensure confidentiality.
In our above sample survey the following levels of measurement were used:
To effectively communicate information to a computer, statistics applications feature Excel-like spreadsheets for direct data entry Unlike standard Excel columns labeled A, B, C, etc., professional spreadsheets use variable names for columns, which are typically limited to eight characters For example, the variable "election rating" is abbreviated as "selectio."
For clarity’s sake, a variable name can be linked to a longervariable labelor to an entire survey question The software commands use the variable names – e.g.
“Compute graphic for the variable selectio” – while the printout of the results displays the complete label.
To input survey results into a spreadsheet, start by entering the answers from questionnaire #1 in the first row and those from questionnaire #2 in the second row Since computers can only process numbers, cardinal scale variables are straightforward, as they consist entirely of numerical values For example, if respondent #1 is 31 years old and weighs 63 kg, you would enter 31 and 63 in the corresponding row However, nominal and ordinal variables require coding with numbers; for instance, in the sample dataset, the nominal traits "male" and "female" are assigned the codes "0" and "1," respectively These assignments are documented in a label book, enabling the accurate entry of the remaining results.
Missing Values
When analyzing survey data, a significant issue is the frequent omission of answers and the prevalence of responses indicating a lack of opinion, such as "I don't know." This phenomenon can arise from various factors, including deliberate refusal to answer, missing information, the respondent's inability to provide a response, or indecision.
Faulkenberry and Mason (1978, p 533) distinguish between two main types of answer omissions:
(a) No opinion: respondents are indecisive about an answer (due to an ambiguous question, say).
(b) Non-opinion: respondents have no opinion about a topic.
The study reveals that respondents who often provide a "no opinion" omission are generally more reflective and better educated compared to those who give a "non-opinion" omission Additionally, factors such as gender, age, and ethnic background significantly affect the likelihood of respondents omitting an answer.
The omission of answers can lead to systematic bias, with studies indicating that the lack of opinion can be up to 30% higher when respondents are given the "I don't know" option Eliminating this option can also result in biased results, as respondents who typically choose "I don't know" may provide random or no answers when it's not available This can transform an identifiable error into an undiscovered systematic error, making it crucial to approach answer omissions strategically during data analysis rather than simply eliminating the "I don't know" option.
Omissions of answers in data analysis should not lead to misinterpretations, which is why certain methods do not allow for missing values The presence of these missing values can require the exclusion of additional data In regression or factor analysis, for instance, if a respondent has missing values, all their other responses must also be omitted To avoid significant information loss due to frequent answer omissions, employing a substitution method is the most effective alternative There are five general approaches to address this issue.
To effectively address missing values in data, the most thorough approach is to manually fill them in by conducting additional research for accurate information Often, gaps in data such as revenue and R&D expenditures can be resolved by meticulously analyzing financial reports and other publicly available documents.
To handle missing values in qualitative (nominal) variables, a new category can be established For instance, in a survey where respondents indicate their customer status, those who do not select any option can be classified as "customer status unknown." This approach ensures that all responses are accounted for, preventing data loss.
In frequency tables, missing values are displayed in a distinct line Even when employing advanced methods like regression analysis, it's often feasible to interpret missing values to a certain degree This topic will be revisited in subsequent chapters.
To handle missing values effectively, if additional research or the creation of a new category is not feasible, one can substitute the missing variables with the total arithmetic mean of existing values, applicable only to cardinal scales Moreover, for missing cardinal values, it is more accurate to replace them with the arithmetic mean from a specific subgroup, such as the mean of students within the same course of study, rather than using the overall mean of the entire student population.
It is crucial to ensure that any omitted answers are truly non-systematic, as compensating for missing values in systematic cases can lead to significant distortions When omissions occur in a non-systematic manner, missing values can be estimated with reasonable accuracy However, it is important to avoid underrepresenting the value distribution, which could ultimately mislead the results.
Roderick et al (1995) emphasize that using mean imputation for missing data can lead to understated variances and distorted associations between variables, resulting in an inconsistent covariance matrix estimate When the amount of missing data is substantial, more complex estimation techniques, primarily based on regression analysis, become essential For example, if a company lacks complete information on its R&D expenditures, one can estimate the missing values by considering known factors such as company sector, size, and location It is crucial to maintain clarity regarding the reasons behind missing data when filling in these gaps, as demonstrated in scenarios like telephone interviews.
• Respondents who do not provide a response because they do not know the answer;
• Respondents who have an answer but do not want to communicate it; and
• Respondents who do not provide a response because the question is directed to a different age group than theirs.
In certain studies, responses may be omitted due to the design, resulting in missing values Conversely, in other instances, values can initially be assigned but later categorized as missing by the analysis software.
Outliers and Obviously Incorrect Values
Incorrect values in standardized customer surveys can significantly skew data, much like missing values For instance, a respondent may indicate they are unemployed but report an unrealistic income, such as €1,000,000,000 Including such a response in a survey of 500 participants could inflate the average income by €2,000,000, highlighting the importance of data accuracy in survey results.
2.5 Outliers and Obviously Incorrect Values 21
To ensure data accuracy, it is essential to remove obviously incorrect answers from the dataset Intentionally erroneous income figures can be addressed by marking them as missing values or by assigning estimated values using the techniques outlined in Section 2.4.
Incorrect values in business surveys can arise unintentionally due to errors, such as respondents mistakenly providing absolute revenue figures instead of the requested amounts in thousands of euros This can lead to inflated revenue figures that are significantly higher than reality It is crucial to identify and correct such mistakes before proceeding with data analysis to ensure accurate results.
Unintentional data inaccuracies can arise when businesses provide expenditure breakdowns, often leading to totals exceeding 100% Similar discrepancies can occur with individual submissions Additionally, while some values may be accurate, they can be outliers that skew analysis, such as a company's founder retiring at nearly 80, which would distort the average retirement age for employees In such cases, it may be appropriate to exclude outliers from analysis, particularly if the context justifies it A common approach is to trim the dataset by removing the highest and lowest five percent of values.
Chapter Exercises
For each of the following statistical units, provide traits and trait values: (a) Patient cause of death
For each of the following traits, indicate the appropriate level of measurement: (a) Student part-time jobs
(b) Market share of a product between 0% and 100%
(c) Students’ chosen programme of study
Use Stata, SPSS, or Excel for the questionnaire in Fig 2.1(p 16) and enter the data from Fig.3.1(p 24) Allow for missing values in the dataset.
First Steps in Data Analysis
After completing their survey on bread spreads, the students coded the responses from 850 participants and entered the data into a computer They began their data assessment with univariate analysis, examining each variable, such as the average age of respondents, individually In contrast, bivariate analysis focuses on the relationship between two variables, like gender and spread preference When analyzing relationships involving more than two variables, researchers employ multivariate analysis.
The significance of statistics is highlighted when distilling the results of 850 responses to provide a realistic impression of surveyed attributes and their relationships For instance, when a professor is asked about final exam results, students anticipate concise information, such as “the average score was 75%” or “the failure rate was 29.4%.” This distilled data allows students to form an accurate assessment of overall performance.
“an average score of 75 % is worse than the 82 % average on the last final exam”.
A single distilled piece of data – in this case, the average score – appears sufficient to sum up the performance of the entire class 1
This chapter and the next will describe methods of distilling data and their attendant problems The above survey will be used throughout as an example.
Chapter 3 Translated from the German original, Cleff, T (2011) 3 Vom Datensatz zur Information.
In Deskriptive Statistik und moderne Datenanalyse (pp 31–77) # Gabler Verlag, Springer Fachmedien Wiesbaden GmbH, 2011.
Student assessment relies on specific distribution methods, as an average score of 75% can be achieved in different ways—either all students scoring 75% or half scoring 50% while the other half scores 100% Despite having the same average, the qualitative differences in these outcomes are significant, highlighting that the average alone is insufficient to fully represent the results.
T Cleff, Exploratory Data Analysis in Business and Economics,
DOI 10.1007/978-3-319-01517-0_3, # Springer International Publishing Switzerland 2014
Graphical representations and frequency tables provide an overview of the univariate distribution of nominal and ordinal variables In the frequency table, each variable trait is displayed on its own line, intersecting with columns for absolute frequency, relative frequency (in %), valid percentage values, and cumulative percentage The relative frequency of trait xi is denoted as f(xi) Missing values are listed separately with their percentage, and they are excluded from valid percentage and cumulative percentage calculations The cumulative percentage sums all rows up to the specified row, with an example indicating that 88.1% of respondents rated the selection as average or worse This cumulative frequency is represented algebraically as a distribution function, F(x).
F xp ẳf xð ị ỵ1 f xð ị ỵ ỵ2 f xp ẳX p n i ẳ 1 f xð ịi (3.1)
Graphical representations of results can be effectively displayed using a pie chart, a horizontal bar chart, or a vertical bar chart While all three chart types are suitable for both nominal and ordinal variables, pie charts are predominantly utilized for nominal data.
Analysis of only one variable: Univariate Analysis
Note: Using SPSS or Stata: The data editor can usually be set to display the codes or labels for the variables, though the numerical values are stored
Fig 3.1 Survey data entered in the data editor
2 Relative frequency (f(x i )) equals the absolute frequency (h(x i )) relative to all valid and invalid observations (N ẳ N valid ỵ N invalid ): f(x i ) ẳ h(x i )/N.
3 Valid percentage (gf(x i )) equals the absolute frequency (h(x i )) relative to all valid observations(N valid ): g(x i ) ẳ h(x i )/N valid
In a bar chart, the x-axis represents various traits such as poor, fair, average, good, and excellent, while the y-axis displays either relative or absolute frequency The height of each bar corresponds to the frequency of the respective x-value When relative frequencies are plotted on the y-axis, the resulting graph illustrates the frequency function.
The distribution of an ordinally scaled variable can be effectively represented using the F(x) distribution function, which plots the characteristics of the x-variables on the x-axis and cumulative percentages on the y-axis, creating a step function This method mirrors the cumulative percentages column found in a frequency table, providing a clear visualization of data distribution.
Many publications often start the y-axis of vertical bar charts at an arbitrary value rather than zero, which can cause confusion for viewers As illustrated in Fig 3.5, this practice can misrepresent data, even though both graphs depict the same information regarding the relative frequencies of male and female respondents, at 49% and 51%, respectively.
50% poor fair average good excellent
Fig 3.3 Bar chart/Frequency distribution for the selection variable
Fig 3.2 Frequency table for selection ratings
3.1 First Steps in Data Analysis 25
The first graph misleadingly suggests a ratio of five females to one male due to a truncated y-axis, which exaggerates the difference in relative frequency between genders This distortion makes the 2 percentage point difference appear more significant than it truly is Therefore, the second graph in Fig 3.5 is a more accurate representation of the data.
Similar distortions can arise when two alternate forms of a pie chart are used.
The first chart in Fig 3.6 illustrates relative frequency through the size of each wedge, with the angles of the circle segments weighted to reflect the formula αi = f(xi) * 360.
To enhance readability and effectiveness, key traits in pie charts should be positioned at the 12 o’clock mark, as viewers typically read them clockwise from the top Additionally, it's crucial to limit the number of segments in the chart to avoid confusion, and to organize the segments systematically, such as by size or content.
100% poor fair average good excellent
Fig 3.4 Distribution function for the selection variable
Fig 3.5 Different representations of the same data (1)
The second graph in Fig 3.6, a modern “perspective” or “3D” pie chart, offers a contemporary look but misrepresents data as the area of each wedge does not accurately reflect relative frequency This can lead to misleading interpretations, as segments in the foreground appear larger while those in the back remain obscured Additionally, the “lifting up” effect of specific wedges can further exaggerate this visual distortion.
When representing cardinal variables such as bodyweight, using a vertical bar diagram may not be effective due to the overwhelming number of traits and the minimal variation in bar heights Often, a specific trait may only appear once among the cardinal variables, making it challenging to convey all essential relationships clearly Therefore, it is advisable to group individual values of cardinal variables into classes for better clarity, as illustrated in Fig 3.7.
In statistical classification, the upper limit of a class is included in that class, while the lower limit is excluded For instance, individuals weighing 60 kg are categorized within the 50–60 kg range, whereas those weighing 50 kg fall into the lower class It is essential for data analysts to define class sizes and membership criteria, particularly at the boundaries, and to transparently communicate their decisions regarding these classifications.
A histogram is a graphical representation of cardinal variables that displays relative class frequency through area rather than height In this representation, the height of the bars indicates frequency density, meaning that denser bars correspond to a higher number of observations within that class As frequency density increases, so does the area of the bars, adhering to the principle that intervals must be chosen carefully to avoid data distortion The area allocated to each class in the histogram reflects its relative frequency compared to the total area of all classes.
Fig 3.6 Different representations of the same data (2)
4 For each ith class, the following applies: x i < X x i þ 1 with i ∈ {1, 2, , k}.
Measures of Central Tendency
Mode or Modal Value
The most basic measure of central tendency is known as the mode or modal value The mode identifies the value that appears most frequently in a distribution.
In Figure 3.9, part 1 illustrates that the mode, represented by grade C, serves as the "champion" of the distribution, highlighting the most frequently selected item among five competing products This measure holds significant importance in voting scenarios, although its implications may not always be straightforward In cases of tied votes, multiple modal values can exist Many software programs tend to identify only the smallest trait, which can lead to misinterpretations, especially when values are widely dispersed For example, if the age traits of 18 and 80 appear in equal quantities, yet surpass all other values, some software may inaccurately designate the mode as 18.
Mean
The arithmetic mean, commonly known as the average, is calculated based on the type of data being analyzed In empirical research, data typically exists in raw data tables that display individual trait values For these raw data tables, the mean is determined using the formula: x̄ = (x1 + x2 + + xn) / n.
All values of a variable are added and divided by n For instance, given the values
The mean can be visualized as a balance scale, where deviations from the mean act as weights For instance, a deviation of 3 units from the mean corresponds to a weight of 3 grams on the left side of the scale The greater the distance from the mean, the heavier the corresponding weight, with all negative deviations positioned on the left side.
In Figure 3.9, the grade averages for two final exams illustrate that the arithmetic mean is balanced, with all positive deviations positioned on the right This balance signifies that the total of negative deviations is equal to the total of positive deviations, highlighting the fundamental property of the arithmetic mean.
In situations where a heavy weight is balanced by multiple lighter weights, the mean may not accurately represent the distribution, potentially leading to over- or underestimation of the smaller weights This issue, often caused by outlier values, can distort results, as discussed in Section 2.5 For example, when calculating the average age of animals in a zoo terrarium with five snakes, nine spiders, five crocodiles, and one turtle, the presence of the turtle may skew the average age calculation.
120 years old, while all the others are no older than four (Fig.3.11).
The average age of the animals is calculated to be 7.85 years To achieve a balanced scale, the aged turtle must stand alone on one side, with all other animals positioned on the opposite side This scenario illustrates that the mean value is not an effective representation of the average age, as only one other animal exceeds the age of three.
To mitigate the impact of outliers, practitioners often use a trimmed mean, which removes the smallest and largest 5% of values prior to calculating the average In a scenario with 20 animals, this 5% trim excludes both the youngest and oldest individuals, leading to a more accurate average age of 2 years that better reflects the age distribution However, it is important to note that this method discards 10% of the data, which can pose challenges, particularly when working with small sample sizes.
Let us return to the “normal” mean, which can be calculated from a frequency table (such as an overview of grades) using the following formula: xẳ1 n
Fig 3.10 Mean expressed as a balanced scale
In this example, we analyze the frequency table in Fig 3.2, where the index \( v \) represents various traits of observed ordinal variables, including poor, fair, average, good, and excellent The value \( n_v \) indicates the absolute number of observations for each trait, with the trait "good" yielding a value of \( n_v = 462 \) The variable \( x_v \) corresponds to the trait value of the index \( v \), assigning values of \( x_1 = 1 \) for "poor," \( x_2 = 2 \) for "fair," and so on The mean can then be calculated using these values.
The respondents gave an average rating of 1.93, which approximately corresponds tofair The mean could also have been calculated using the relative frequencies of the traitsfv: xẳð0:461ỵ0:3132ỵ0:1083ỵ0:0734ỵ0:0465ị ẳ1:93 (3.6)
Finally, the mean can also be calculated from traditional classed data according to this formula: xẳ1 n
X k v ẳ 1 nvmvẳX k v ẳ 1 f v mv; (3.7) wheremvis the mean of class numberv.
Students often confuse frequency tables with classed data, as both involve trait classifications The mean of classed data is derived from cardinal variables summarized into classes based on specific assumptions This method can also be applied to histograms For instance, the mean bodyweight calculation in Fig 3.7 aligns with the raw data table However, when only a histogram is available, as shown in part 2 of Fig 3.7, the mean can still be determined Figure 3.12 illustrates a simplified histogram representation with six classes.
Note : Mean = 7.85 years; 5 % trimmed mean = 2 years
Fig 3.11 Mean or trimmed mean using the zoo example
We assume that observations are evenly distributed within a class, leading to a linear increase in cumulative frequency from the lower to the upper limit of the class Consequently, the average class frequency is equal to the mean To calculate the overall mean, sum the products of the class midpoints and their corresponding relative frequencies.
Here is another example to illustrate the calculation Consider the following information on water use by private households (Table3.1):
The water-use average can be calculated as follows: xẳX k v ẳ 1 f v mvẳX 4 v ẳ 1 f v mvẳ0:2100ỵ0:5300ỵ0:2500ỵ0:1800ẳ350 (3.8)
The calculation of the mean assumes equidistant intervals between traits, making it impossible to determine the mean for nominal variables While ordinal variables also do not strictly allow for mean calculations, some researchers with large sample sizes (approximately n>99) may calculate the mean by assuming equidistance for practical purposes.
The mean can be misleading, as illustrated by average test grades where a C can result from all students scoring C or half scoring A and half F Similarly, temperature averages can obscure significant differences in climate experiences For instance, Beijing, Quito, and Milan all have an average temperature of 12°C, yet Beijing's winters are colder than Stockholm's, and its summers are hotter than Rio de Janeiro's Milan enjoys Mediterranean temperatures that vary seasonally, while Quito's altitude keeps temperatures relatively constant year-round (Swoboda 1971, p 36).
Fig 3.12 Calculating the mean from classed data
Table 3.1 Example of mean calculation from classed data
Source: Schwarze (2008, p 16), translated from the German
The average often fails to provide a comprehensive understanding of data, particularly when information about distribution or relevant weightings is absent, leading to potentially misleading interpretations As highlighted by Krämer (2005, p 61), numerous examples illustrate this issue, emphasizing the need for careful analysis beyond mere averages.
• Means rarely result in whole numbers For instance, what do we mean by the decimal place when we talk of 1.7 children per family or 3.5 sexual partners per person?
When calculating the arithmetic mean, it is essential to recognize that not all values hold equal weight For instance, a Wild West eatery owner might claim his stew contains equal parts horse and jackrabbit, but a more accurate description would depend on the actual proportions In economic terms, if the average salary for women is 20 monetary units (MUs) and for men is 30 MUs, the overall average salary isn’t simply 25 MUs, especially if men make up 70% of the workforce This scenario illustrates the concept of a weighted arithmetic mean, where values are adjusted according to their significance Similarly, the Federal Statistical Office of Germany uses a weighted approach to calculate price increases for various products, ensuring that items like bananas and vehicles are valued according to their average share in household consumption.
The choice of reference base significantly influences data interpretation, particularly regarding traffic deaths When measured by deaths per passenger-kilometres, trains report nine fatalities per 10 billion kilometres, while planes have three deaths in the same distance, a statistic often highlighted by airlines in their advertising However, when evaluating traffic deaths based on travel time, the risks shift dramatically, with trains experiencing seven fatalities per 100 million passenger-hours, compared to planes' 24 deaths per 100 million passenger-hours Both reference bases are valid, and it is essential for empirical researchers to clarify their rationale for selecting one over the other.
I share a fear of flying and align with Kra¨mer (2005, p 70) in believing that passenger-hours serve as a more accurate measure of flight safety It's interesting to note that while many people fear flying, they rarely worry about the dangers of sleeping in their own beds, despite the fact that the risk of dying in bed is significantly higher.
99 % Of course, this likelihood seems less threatening when measured against the time we spend in bed.
Geometric Mean
Common issues arise from improper weightings or the selection of an inadequate reference base Additionally, the arithmetic mean can yield misleading results, even with correct weighting and reference bases, particularly in economics when assessing rates of change or growth These rates rely on time series data, which track observations over time For instance, Figure 3.13 illustrates sales figures and their corresponding rates of change over a five-year period.
Calculating the average rate of change using the arithmetic mean yields a growth rate of 1.25%, suggesting that sales should have risen from €20,000 in 2002 to €21,018.91 by 2006 However, actual sales in 2006 were only €20,691.00, highlighting potential inaccuracies in this method To address these discrepancies, the geometric mean is utilized for rates of change, effectively linking initial sales from 2002 to the annual growth rates leading up to 2006.
U6ẳU5ð1ỵ0:1ị ẳðU4ð10:1ịị ð1ỵ0:1ị ẳ: : : ẳðU2ð1ỵ0:1ịị ð10:05ị ð10:1ị ð1ỵ0:1ị: (3.9)
To determine the average change in sales for this chain, the product of the four rates of change, (1 + 0.1)(1 - 0.05)(1 - 0.1)(1 + 0.1), should equal the result of applying the average rate of change four times.
(3.10) For the geometric mean, the yearly rate of change is thus: pgeomẳ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
The last column in Fig 3.13 shows that this value correctly describes the sales growth between 2002 and 2006 Generally, the following formula applies for identifyingaverage rates of change: p geom ẳ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Changes in sales when using Year Sales [mio.]
Rate of change [in %] arithm mean geom mean
Fig 3.13 An example of geometric mean
The geometric mean for rates of change is a special instance of thegeometric mean, and is defined as follows: xgeomẳ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi x1x2 : : :xn pn ẳ
The geometric mean, applicable only to positive values, is equivalent to the arithmetic mean of the logarithms of a set of numbers When dealing with observations of varying sizes, the geometric mean consistently remains less than the arithmetic mean.
Harmonic Mean
The harmonic mean is a rarely used measure in economics that is often overlooked in favor of the arithmetic mean, which can sometimes yield misleading results It is particularly useful for averaging ratios with differing numerators and denominators, such as unemployment rates, sales productivity, and price per litre For example, when analyzing the sales productivity of three companies with varying employee counts but the same revenue, the harmonic mean provides a more accurate representation of their performance.
To effectively compare companies, it's essential to analyze their sales productivity independent of size A straightforward weighted calculation can be applied to assess each firm's performance The average sales per employee can be determined through this method.
If this value were equally applicable to all employees, the firms – which have
In a scenario where 16 employees would ideally generate total sales of €16,433.33, the actual sales amount to only €3,000 This discrepancy highlights the importance of considering the varying number of employees and their distinct contributions to overall productivity when calculating company sales Even companies with identical sales figures can have different employee headcounts, leading to varied productivity metrics To accurately assess each employee's contribution to sales, one must calculate the weighted sales productivity by multiplying individual sales productivity observations by the number of employees, summing these values, and dividing by the total number of employees, resulting in a weighted arithmetic mean.
6 If all values are available in logarithmic form, the following applies to the arithmetic mean: 1 n ð ln ð ị ỵ x 1 : : : ỵ ln ð ị x n ị ẳ 1 n ln ð x 1 : : : x n ị ẳ ln ð x 1 : : : x n ị 1 n ẳ
Y n iẳ1 x i n s ẳ x geom : n1SP1þn2SP2þn3SP3 n ẳ10€100
The 16 employees achieve a total sales figure of €3,000 To determine the sales productivity, k, when the employee weighting is unknown, the unweighted harmonic mean must be utilized.
A student walks 3 km to his university, with varying speeds: 2 km/h for the first kilometre, 3 km/h for the second, and 4 km/h for the last This scenario highlights the harmonic mean, as using the arithmetic mean would produce an inaccurate result.
3 2km h þ3km h þ4km h ẳ3km h ;or 1 hour (3.17)
The breakdown of the route reveals that the first kilometre takes 30 minutes, the second kilometre takes 20 minutes, and the final kilometre takes 15 minutes, leading to a total duration of 65 minutes Consequently, the weighted average speed for the journey is calculated to be 2.77 km/h This result can also be confirmed using the harmonic mean formula with k equal to 3 for the different route segments.
Sales Employees Sales per employee (SP) Formula in Excel
7 (30 min 2 km/h ỵ 20 min 3 km/h ỵ 15 min 4 km/h) /65 min ẳ 2.77 km/h.
In previous examples, the numerator values were the same for all observations, with each of the three companies reporting sales of €1,000 and each route segment measuring 1 km However, when the values differ, it is necessary to calculate the unweighted harmonic mean For instance, if the three companies had sales figures of €1,000, €2,000, and €5,000, the calculation for the harmonic mean would be applied accordingly.
As we can see here, the unweighted harmonic mean is a special case of the weighted harmonic mean.
Fractions do not always require the harmonic mean for calculations For instance, when determining average speed for a route to the university campus that involves varying travel times, the arithmetic mean is more appropriate If one student walks for an hour at a speed of 2 km/h, and another walks for a different duration, using the arithmetic mean will yield a more accurate average speed.
The average speed can be accurately calculated by using the arithmetic mean of the speeds, which is 3 km/h for the first part of the journey and 4 km/h for the last hour In this scenario, the time duration remains constant, allowing the numerator, representing the length of the partial route, to effectively determine the average speed.
The harmonic mean is appropriate when dealing with ratios and when relative weights are represented by the numerator values, such as kilometers Conversely, if the relative weights are expressed in the units of the denominator, like hours, the arithmetic mean should be utilized It is important to remember that the harmonic mean, similar to the geometric mean, is only applicable for positive values greater than zero Additionally, for observations of unequal sizes, the relationship x_harmonic < x_geometric < x holds true.
The Median
When the mean does not accurately reflect a distribution, alternative measures of central tendency are necessary For instance, in an advertising agency tasked with determining the average age of diaper users for a campaign, it is essential to analyze relevant data effectively to derive meaningful insights.
Calculating the mean using class midpoints for diaper users yields an average age of 21 years, suggesting that the typical user is college-aged However, this conclusion is questionable due to the lack of baby-care facilities in universities and the influence of extreme age values in the data These high values in the outer margins create a bimodal distribution, resulting in a misleading mean that falls within the age range where diaper usage is actually minimal.
To calculate the average age of diaper users, one effective method is to determine the median age within the primary age group of 0–1 years The median not only provides more accurate results in such scenarios but also divides the ordered dataset into two equal halves, with 50% of the values below and 50% above this central point.
In Figure 3.14, five weights are arranged in order of heaviness, with the median identified as x~ = x(0.5) = x(3) = 9, indicating that 50% of the weights lie on either side of the third weight Various formulas exist for calculating the median, and statistics textbooks typically recommend specific formulas for raw, unclassed data tables.
2 for an odd number of observations nð ị (3.22) and
Table 3.3 Share of sales by age class for diaper users
Fig 3.14 The median: The central value of unclassed data
8 To find the value for the last class midpoint, take half the class width – (101–61)/2 ẳ 20 – and from that we get 61 ỵ 20 ẳ 81 years for the midpoint.
The median is defined as the middle value in a data set, and its calculation differs based on the number of observations When there is an even number of observations, the median is the average of the two central values, while with an odd number of observations, it is the single middle value In both scenarios, the median divides the data such that 50% of the observations are below it and 50% are above it.
Afor an even number of observations: (3.23)
If one plugs in the weights from the example into the first formula, we get:
In an ordered dataset, the weight at the third position represents the median When calculating the median from a grouped dataset, such as in the diaper example, a specific formula is used to derive the result.
~ xẳx0 : 5ẳx UP i 1 ỵ0:5F x UP i 1 f xð ịi x UP i x LOW i
In our analysis, we determine the class where 50% of observations are nearly exceeded, which, in the context of our diaper example, pertains to 1-year-olds The median is positioned above the upper limit of this class, specifically 1 year However, we need to quantify how many years it exceeds this limit, noting a 5 percentage point difference between the assumed value of 0.5 and the upper limit value.
This 5 % points must be accounted for from the next largest (ith) class, as it must contain the median The 5 % points are then set in relation to the relative frequency of the entire class:
To accurately determine the median age of diaper users, we must add 20% of the width of the age class containing the median, resulting in a Δi of 3 years, as this class includes children aged 2, 3, and 4 years old This calculation yields a median age of approximately 2.6 years, which more effectively represents the "average user of diapers" compared to the arithmetic mean It's important to note that calculating the median in a bimodal distribution can be as challenging as finding the mean The reliability of this median value is enhanced by the specific characteristics of the data, making it particularly useful in cases with numerous outliers.
Quartile and Percentile
In addition to the median, quantiles are significant measures of central tendency derived from ordered datasets When these quantiles are divided into 100 equal intervals, they are known as percentiles Calculating percentiles requires an ordinal or cardinal scale, defined similarly to the median In an ordered dataset, the pth percentile represents the value at which at least p percent of observations are equal to or below it, while at least (1-p) percent are equal to or above it For example, in a grocery store survey, the 17th percentile of age is 23 years, indicating that 17% of respondents are 23 years or younger, with 83% being older This interpretation aligns closely with that of the median, which is a specific instance (50th percentile) of quantiles that divides the ordered dataset into parts.
Quartiles are a significant group of quantiles used in practical applications, dividing an ordered dataset into four equal parts The first quartile, also known as the lower quartile or 25th percentile, represents the point below which 25% of the data falls The second quartile, or median, corresponds to the 50th percentile, while the third quartile, referred to as the upper quartile or 75th percentile, indicates the point below which 75% of the data lies.
The weighted average method is a valuable technique for calculating quantiles from raw data, commonly used in statistical software For example, to find the lower quartile (p = 25%) in an ordered sample of size n = 850, we calculate (n + 1)p, resulting in (850 + 1)0.25 = 212.75 This gives us an integer part (i = 212) and a decimal fraction (f = 0.75), indicating that the desired quantile lies between the 212th and 213th observations in the dataset The exact position of the quantile is determined by the decimal fraction, which helps pinpoint its location within these ranks.
Fig 3.15 The median: The middle value of classed data
3.2 Measures of Central Tendency 41 the total value was 212.75, which is to say, closer to 213 than to 212 The figures after the decimal mark can be used to locate the position between the values with the following formula:
In our butter example, the variable bodyweight produces these results:
Another example for the calculation of the quartile is shown in Fig.3.16.
The weighted average method is not applicable for extreme quantiles, such as the 99% quantile, which requires a sixth fictitious weight, as calculations show it to be 5.94 Similarly, determining the 1% quantile necessitates a non-existent weight of 0, resulting in a quantile value of 0.06 In these scenarios, software programs instead report the actual largest and smallest variable traits as quantiles, yielding values of x0.99 = 15 and x0.01 = 3.