THEORETICAL BACKGROUND2.1.Linear regression Linear regression is a statistical modeling technique used to establish a linear relationship between a dependent variable and one or more ind
Trang 1HO CHI MINH UNIVERSITY OF TECHNOLOGYFACULTY OF APPLIED SCIENCE
Assignment Report
LINEAR REGRESSION MODEL FOR PREDICTING 3D PRINT QUALITY AND
STRENGTH
[Probability and Statistics – MT2013]
Supervisor: Dr Nguyen Tien Dung
Trang 3DATA INTRODUCTION 4Dataset description: 4
Trang 41 DATA INTRODUCTION
1.1.Dataset description:
The precision demanded in mechanical engineering, particularly in the realm of mechatronic engineering and 3D printing, is paramount, especially concerning the strength of materials employed An intricate analysis of each material's specifications becomes imperative to identify the optimal choice for specific applications
One pivotal factor influencing the quality and functionality of printed components is surface roughness This term denotes the small irregularities or deviations present on an object's surface, impacting not only its visual appeal but also its functional attributes Surface roughness values exhibit significant variation depending on factors such as the 3D printing process, the material used, and other process parameters The comprehension and precise control of surface roughness are pivotal in ensuring the consistent production of high-quality parts with reliability
3D-To delve into this subject further, an overview will be provided, shedding light on the nuances of surface roughness Utilizing statistical methods allows us to decipher the correlation between product roughness and the contributing factors, facilitating a more informed approach to
- The dataset consists of 11 features or variables Each row represents
an observation or data point, and the dataset contains a total of 50
observations
Trang 5- The `material` column has been encoded, representing "abs" as 0 and
Trang 6Range fanroughness Continues
Trang 72 THEORETICAL BACKGROUND
2.1.Linear regression
Linear regression is a statistical modeling technique used to establish a linear relationship between a dependent variable and one or more independent variables It assumes that this relationship can be approximated by a straight line
The basic form of linear regression is represented by the equation:
y = β + β x + β x + + β x + ε₀ ₁ ₁ ₂ ₂ ₚ ₚ
where:
- y is the dependent variable or the variable being predicted
- x , x , , x are the independent variables or predictors.₁ ₂ ₚ
- β , β , β , , β are the coefficients or parameters that quantify the ₀ ₁ ₂ ₚrelationship between the variables
- ε is the error term or residual, representing the unexplained variation in the dependent variable
The goal of linear regression is to estimate the values of the coefficients (β , β ,₀ ₁
β , , β ) that minimize the sum of squared residuals This is typically done using the ₂ ₚmethod of ordinary least squares (OLS), which finds the coefficients that minimize thesum of the squared differences between the observed values of the dependent variable and the predicted values
Linear regression assumes several key assumptions:
1 Linearity: It assumes that the relationship between the independent variables and the dependent variable is linear This means that the change in the dependent variable is proportional to the change in the independent variables
2 Independence: It assumes that the observations or data points are
independent of each other
3 Homoscedasticity: It assumes that the variance of the error term is constant
Trang 84 Normality: It assumes that the error term follows a normal distribution, meaning that the residuals should be normally distributed around zero.
5 No multicollinearity: It assumes that the independent variables are not highlycorrelated with each other, as high correlation can lead to collinearity issues and unstable coefficient estimates
Linear regression can be extended to handle more complex relationships by incorporating polynomial terms, interaction terms, or employing techniques like feature engineering or regularization
The output of a linear regression model provides insights into the strength and significance of the relationship between the independent variables and the dependent variable It includes the estimated coefficients, their standard errors, p-values, and measures of model fit such as the R-squared and adjusted R-squared
The R-squared value, also known as the coefficient of determination, is a statistical measure that indicates the proportion of the variance in the dependent variable that is explained by the independent variables in a regression model It provides an assessment of how well the regression model fits the observed data
The R-squared value ranges from 0 to 1, where:
- 0 indicates that none of the variation in the dependent variable is explained bythe independent variables, meaning the model does not fit the data well
- 1 indicates that all of the variation in the dependent variable is explained by the independent variables, meaning the model perfectly fits the data
The R-squared value is calculated as follows:
R-squared = 1 - (SSR/SST)
where:
- SSR (Sum of Squared Residuals) represents the sum of the squared
differences between the observed values of the dependent variable and the predicted
Trang 9- SST (Total Sum of Squares) represents the sum of the squared differences between the observed values of the dependent variable and the mean of the dependent variable.
A higher R-squared value indicates a better fit of the regression model to the data, as it suggests that a larger proportion of the variance in the dependent variable is accounted for by the independent variables However, it is important to interpret the R-squared value in the context of the specific problem and the nature of the data A high R-squared does not necessarily imply that the model is accurate or that the independent variables have a causal relationship with the dependent variable
It is also worth noting that the R-squared value can be misleading when used inappropriately It should be considered alongside other metrics and the overall context of the analysis For example, it is important to assess the statistical
significance of the coefficients, examine the residual analysis, and consider the specific goals and limitations of the regression model
Linear regression is widely used in various fields, including economics, finance, social sciences, engineering, and machine learning, to analyze and predict relationships between variables
2.2.Definition
Arithmetic Mean:
The arithmetic mean, often referred to as the average, is a measure of central tendency that provides the typical or representative value of a set of numbers It is computed by summing all the values in a dataset and dividing the sum by the total number of observations Mathematically, the arithmetic mean is represented as:Mean = (x + x + + xn) / n₁ ₂
where x , x , , xn are the individual values in the dataset, and n is the total ₁ ₂number of observations The arithmetic mean is sensitive to extreme values and provides a balanced representation of the dataset
Median:
Trang 10The median is another measure of central tendency that represents the middle value in a sorted dataset It divides the dataset into two equal halves, with 50% of the observations falling below it and 50% above it To calculate the median, the dataset is first arranged in ascending or descending order, and then the middle value is selected
In the case of an odd number of observations, the middle value is the median For an even number of observations, the median is the average of the two middle values The median is less affected by extreme values and is commonly used when the dataset contains outliers
Standard Deviation:
The standard deviation is a measure of the dispersion or variability of a dataset
It quantifies the average amount by which each data point deviates from the mean The standard deviation is calculated by taking the square root of the variance Mathematically, it can be represented as:
Standard Deviation = √( ( (x - Mean)² + (x - Mean)² + + (xn - Mean)² ) / n )₁ ₂where x , x , , xn are the individual values in the dataset, Mean is the ₁ ₂arithmetic mean of the dataset, and n is the total number of observations A higher standard deviation indicates greater variability in the dataset, while a lower standard deviation indicates more clustered or homogeneous data
Minimum and Maximum:
The minimum and maximum are the smallest and largest values, respectively,
in a dataset They provide insight into the range of values present in the dataset The minimum represents the lowest value, while the maximum represents the highest value These values are helpful in understanding the boundaries or extremes within thedataset
These statistical measures, including the arithmetic mean, median, standard deviation, minimum, and maximum, provide important descriptive information about
a dataset, helping to summarize its central tendency, variability, and range They are widely used in data analysis, research, and various fields to gain insights and make informed decisions
Trang 12Click on “Browse” and choose the dataset we want to import Below we can see there is a code preview.
Then click “Import”
Trang 133.2. Data cleaning
Create a new file containing only the key variables given in the topic, save it as
“data2” and check the first 5 rows of the new file
Let us break down the code:
● Name the new dataset “data2” This R code is selecting specific columns from
a dataset named “data1” and storing them into a new variable called “data2”
● The square brackets “[ ]” are used for indexing or selecting elements from a dataset
● Within the first set of square brackets, the comma “ , “ indicates that we want
to select all rows from the dataset (since there is no number in front of the comma)
● Inside the second set of square brackets, “c()” is used to create a vector containing the names of columns we want to select including: "layer_height",
"wall_thickness", "infill_density", "infill_pattern", "nozzle_temperature",
"bed_temperature", "print_speed", "material", "fan_speed" and "roughness"
Checking for missing values
Explanation:
apply(is.na(data2),2,which)
Trang 14is.na(data2): This part of the code generates a logical matrix of the same size
as “data2”, where each element is “TRUE” if the corresponding element in “data2” is
“NA” (which means missing) and “FALSE” otherwise
apply(is.na(data2), 2, which): Here, the “apply()” function is used to apply the “which()” function to each column (2 is not a number, instead it indicates columns
in R) of the logical matrix generated by is.na(data2) The “which()” function, when applied to a logical vector or matrix, returns the indices (position) where the value is
Trang 15From the above result, we know that “data2” has no missing values.
Trang 164 DESCRIPTIVE STATISTICS
4.1.Descriptive statistics for quantitative variables
Given the quantitative variables, we can calculate the mean, median, standard deviation, minimum, maximum, first and third quantile values (layer_height,wall_thickness, infill_density, nozzle_temperature, bed_temperature, print_speed, fan_speed, roughness) and save those to variables named Mean, Median, Standard_deviation, Min, Max, Q1, Q3
Trang 17“Min”, “Max”, “Q1” and “Q3”.
● data.frame( ): This function constructs a data frame from multiple variables listed above
● t(): This function transposes the data frame Transposing essentially switches the rows and columns of a matrix or data frame So, the resulting data.frame is then transposed, which means the rows and columns are swapped If we don’t transpose thedata frame, it will look like this:
Trang 18So, in order to get a better view of the data, transposing is necessary.
4.2.Descriptive statistics for categorical variables
Regarding the qualitative variables, we set the frequency of each variable as follows:
Result:
table(data2$infill_pattern)
table(data2$material)
Trang 194.3.Histogram for roughness
Histogram is a simple and easy to understand graph that represents the frequency distribution of numerical data within a specific range The range of values
is divided into intervals called bins or cells
Draw histogram of the variable: roughness
Rcode:
Result:
# Histogram graph for the variable “roughness”
hist(data2$roughness,main = "Histogram of roughness",
xlab = "Roughness", col = "blue", labels = T, ylim =
c(0,15))
Trang 20Explanation: The graph illustrates that the majority of samples have a roughness level predominantly falling within the 0 to 300 µm range Among the 50 samples examined, those displaying roughness between 50 to 100 µm and 150 to 200
µm are notably higher compared to the rest Conversely, samples within the 300 to
400 µm range are the least represented in the dataset
4.4.Boxplot of roughness relative to each categorical variable
Box plot is used to measure how well distributed is the data in the data set by graphically showing us the five numbers summary: minimum, maximum, median, first quartile and third quartile of multiple data sets The minimum and maximum values are represented by two lines extending from each ends of the boxes (also known as whiskers) indicating variability between the upper quartile and lower quartile From which we can determine which data are outliers, represented by circles outside of the variability range
Boxplot relative to infill pattern:
Rcode:
Result:
#Boxplot graph for the variable "infill_pattern"
boxplot(data2$roughness ~ data2$infill_pattern, main =
"Roughness of Infill pattern", col = c(4,5), xlab =
"Infill_ pattern", ylab = "Roughness")
Trang 21With a grid pattern:
- The highest roughness is 368 µm
- The lowest roughness is 24 µm
- 25% of samples have the roughness of 92 µm or less
- 50% of samples have the roughness of 172 µm or less
- 75% of samples have the roughness of 244 µm or less
With a honeycomb pattern:
- The highest roughness is 360 µm
Trang 22- 25% of samples have the roughness of 88 µm or less
- 50% of samples have the roughness of 154 µm or less
Boxplot graph for the variable "material"
Rcode:
Result:
#Boxplot graph for the variable "material"
boxplot(data2$roughness ~ data2$material, main =
"Roughness of Infill pattern", col = c(2,3), xlab =
"Material", ylab = "Roughness")
Trang 23Explanation: 75 % of samples have the roughness of 220 µm or less
Samples whose material is ABS:
- The highest roughness is 368 µm
- The lowest roughness is 25 µm
- 25% of samples have the roughness of 92 µm or less
Trang 24- 75% of samples have the roughness of 289 µm or less
With a honeycomb pattern:
- The highest roughness is 321 µm
- The lowest roughness is 21 µm
- 25% of samples have the roughness of 88 µm or less
- 50% of samples have the roughness of 145 µm or less
- 75% of samples have the roughness of 192 µm or less
Trang 255 INFERENTIAL STATISTICS
Use an appropriate linear regression model to evaluate the factors affecting the
roughness of the product after printing
First, we build the linear regression model with:
- The dependent variable : roughness
- The independent variables : layer_height, wall_thickness, infill_density,
infill_pattern, nozzle_temperature, bed_temperature, print_speed,
material, and fan_speed
The model is displayed as follows: