T linear regression model for predicting 3d print quality and stren

THEORETICAL BACKGROUND2.1.Linear regression Linear regression is a statistical modeling technique used to establish a linear relationship between a dependent variable and one or more ind

Trang 1

HO CHI MINH UNIVERSITY OF TECHNOLOGYFACULTY OF APPLIED SCIENCE

Assignment Report

LINEAR REGRESSION MODEL FOR PREDICTING 3D PRINT QUALITY AND

STRENGTH

[Probability and Statistics – MT2013]

Supervisor: Dr Nguyen Tien Dung

Trang 3

DATA INTRODUCTION 4Dataset description: 4

Trang 4

1 DATA INTRODUCTION

1.1.Dataset description:

The precision demanded in mechanical engineering, particularly in the realm of mechatronic engineering and 3D printing, is paramount, especially concerning the strength of materials employed An intricate analysis of each material's speciﬁcations becomes imperative to identify the optimal choice for speciﬁc applications

One pivotal factor inﬂuencing the quality and functionality of printed components is surface roughness This term denotes the small irregularities or deviations present on an object's surface, impacting not only its visual appeal but also its functional attributes Surface roughness values exhibit signiﬁcant variation depending on factors such as the 3D printing process, the material used, and other process parameters The comprehension and precise control of surface roughness are pivotal in ensuring the consistent production of high-quality parts with reliability

3D-To delve into this subject further, an overview will be provided, shedding light on the nuances of surface roughness Utilizing statistical methods allows us to decipher the correlation between product roughness and the contributing factors, facilitating a more informed approach to

- The dataset consists of 11 features or variables Each row represents

an observation or data point, and the dataset contains a total of 50

observations

Trang 5

- The `material` column has been encoded, representing "abs" as 0 and

Trang 6

Range fanroughness Continues

Trang 7

2 THEORETICAL BACKGROUND

2.1.Linear regression

Linear regression is a statistical modeling technique used to establish a linear relationship between a dependent variable and one or more independent variables It assumes that this relationship can be approximated by a straight line

The basic form of linear regression is represented by the equation:

y = β + β x + β x + + β x + ε₀ ₁ ₁ ₂ ₂ ₚ ₚ

where:

- y is the dependent variable or the variable being predicted

- x , x , , x are the independent variables or predictors.₁ ₂ ₚ

- β , β , β , , β are the coefficients or parameters that quantify the ₀ ₁ ₂ ₚrelationship between the variables

- ε is the error term or residual, representing the unexplained variation in the dependent variable

The goal of linear regression is to estimate the values of the coefﬁcients (β , β ,₀ ₁

β , , β ) that minimize the sum of squared residuals This is typically done using the ₂ ₚmethod of ordinary least squares (OLS), which ﬁnds the coefﬁcients that minimize thesum of the squared differences between the observed values of the dependent variable and the predicted values

Linear regression assumes several key assumptions:

1 Linearity: It assumes that the relationship between the independent variables and the dependent variable is linear This means that the change in the dependent variable is proportional to the change in the independent variables

2 Independence: It assumes that the observations or data points are

independent of each other

3 Homoscedasticity: It assumes that the variance of the error term is constant

Trang 8

4 Normality: It assumes that the error term follows a normal distribution, meaning that the residuals should be normally distributed around zero.

5 No multicollinearity: It assumes that the independent variables are not highlycorrelated with each other, as high correlation can lead to collinearity issues and unstable coefﬁcient estimates

Linear regression can be extended to handle more complex relationships by incorporating polynomial terms, interaction terms, or employing techniques like feature engineering or regularization

The output of a linear regression model provides insights into the strength and significance of the relationship between the independent variables and the dependent variable It includes the estimated coefficients, their standard errors, p-values, and measures of model fit such as the R-squared and adjusted R-squared

The R-squared value, also known as the coefﬁcient of determination, is a statistical measure that indicates the proportion of the variance in the dependent variable that is explained by the independent variables in a regression model It provides an assessment of how well the regression model ﬁts the observed data

The R-squared value ranges from 0 to 1, where:

- 0 indicates that none of the variation in the dependent variable is explained bythe independent variables, meaning the model does not ﬁt the data well

- 1 indicates that all of the variation in the dependent variable is explained by the independent variables, meaning the model perfectly ﬁts the data

The R-squared value is calculated as follows:

R-squared = 1 - (SSR/SST)

where:

- SSR (Sum of Squared Residuals) represents the sum of the squared

differences between the observed values of the dependent variable and the predicted

Trang 9

- SST (Total Sum of Squares) represents the sum of the squared differences between the observed values of the dependent variable and the mean of the dependent variable.

A higher R-squared value indicates a better fit of the regression model to the data, as it suggests that a larger proportion of the variance in the dependent variable is accounted for by the independent variables However, it is important to interpret the R-squared value in the context of the speciﬁc problem and the nature of the data A high R-squared does not necessarily imply that the model is accurate or that the independent variables have a causal relationship with the dependent variable

It is also worth noting that the R-squared value can be misleading when used inappropriately It should be considered alongside other metrics and the overall context of the analysis For example, it is important to assess the statistical

signiﬁcance of the coefficients, examine the residual analysis, and consider the speciﬁc goals and limitations of the regression model

Linear regression is widely used in various ﬁelds, including economics, ﬁnance, social sciences, engineering, and machine learning, to analyze and predict relationships between variables

2.2.Deﬁnition

Arithmetic Mean:

The arithmetic mean, often referred to as the average, is a measure of central tendency that provides the typical or representative value of a set of numbers It is computed by summing all the values in a dataset and dividing the sum by the total number of observations Mathematically, the arithmetic mean is represented as:Mean = (x + x + + xn) / n₁ ₂

where x , x , , xn are the individual values in the dataset, and n is the total ₁ ₂number of observations The arithmetic mean is sensitive to extreme values and provides a balanced representation of the dataset

Median:

Trang 10

The median is another measure of central tendency that represents the middle value in a sorted dataset It divides the dataset into two equal halves, with 50% of the observations falling below it and 50% above it To calculate the median, the dataset is ﬁrst arranged in ascending or descending order, and then the middle value is selected

In the case of an odd number of observations, the middle value is the median For an even number of observations, the median is the average of the two middle values The median is less affected by extreme values and is commonly used when the dataset contains outliers

Standard Deviation:

The standard deviation is a measure of the dispersion or variability of a dataset

It quantiﬁes the average amount by which each data point deviates from the mean The standard deviation is calculated by taking the square root of the variance Mathematically, it can be represented as:

Standard Deviation = √( ( (x - Mean)² + (x - Mean)² + + (xn - Mean)² ) / n )₁ ₂where x , x , , xn are the individual values in the dataset, Mean is the ₁ ₂arithmetic mean of the dataset, and n is the total number of observations A higher standard deviation indicates greater variability in the dataset, while a lower standard deviation indicates more clustered or homogeneous data

Minimum and Maximum:

The minimum and maximum are the smallest and largest values, respectively,

in a dataset They provide insight into the range of values present in the dataset The minimum represents the lowest value, while the maximum represents the highest value These values are helpful in understanding the boundaries or extremes within thedataset

These statistical measures, including the arithmetic mean, median, standard deviation, minimum, and maximum, provide important descriptive information about

a dataset, helping to summarize its central tendency, variability, and range They are widely used in data analysis, research, and various ﬁelds to gain insights and make informed decisions

Trang 12

Click on “Browse” and choose the dataset we want to import Below we can see there is a code preview.

Then click “Import”

Trang 13

3.2. Data cleaning

Create a new ﬁle containing only the key variables given in the topic, save it as

“data2” and check the ﬁrst 5 rows of the new ﬁle

Let us break down the code:

● Name the new dataset “data2” This R code is selecting speciﬁc columns from

a dataset named “data1” and storing them into a new variable called “data2”

● The square brackets “[ ]” are used for indexing or selecting elements from a dataset

● Within the ﬁrst set of square brackets, the comma “ , “ indicates that we want

to select all rows from the dataset (since there is no number in front of the comma)

● Inside the second set of square brackets, “c()” is used to create a vector containing the names of columns we want to select including: "layer_height",

"wall_thickness", "inﬁll_density", "inﬁll_pattern", "nozzle_temperature",

"bed_temperature", "print_speed", "material", "fan_speed" and "roughness"

Checking for missing values

Explanation:

apply(is.na(data2),2,which)

Trang 14

is.na(data2): This part of the code generates a logical matrix of the same size

as “data2”, where each element is “TRUE” if the corresponding element in “data2” is

“NA” (which means missing) and “FALSE” otherwise

apply(is.na(data2), 2, which): Here, the “apply()” function is used to apply the “which()” function to each column (2 is not a number, instead it indicates columns

in R) of the logical matrix generated by is.na(data2) The “which()” function, when applied to a logical vector or matrix, returns the indices (position) where the value is

Trang 15

From the above result, we know that “data2” has no missing values.

Trang 16

4 DESCRIPTIVE STATISTICS

4.1.Descriptive statistics for quantitative variables

Given the quantitative variables, we can calculate the mean, median, standard deviation, minimum, maximum, ﬁrst and third quantile values (layer_height,wall_thickness, inﬁll_density, nozzle_temperature, bed_temperature, print_speed, fan_speed, roughness) and save those to variables named Mean, Median, Standard_deviation, Min, Max, Q1, Q3

Trang 17

“Min”, “Max”, “Q1” and “Q3”.

● data.frame( ): This function constructs a data frame from multiple variables listed above

● t(): This function transposes the data frame Transposing essentially switches the rows and columns of a matrix or data frame So, the resulting data.frame is then transposed, which means the rows and columns are swapped If we don’t transpose thedata frame, it will look like this:

Trang 18

So, in order to get a better view of the data, transposing is necessary.

4.2.Descriptive statistics for categorical variables

Regarding the qualitative variables, we set the frequency of each variable as follows:

Result:

table(data2$inﬁll_pattern)

table(data2$material)

Trang 19

4.3.Histogram for roughness

Histogram is a simple and easy to understand graph that represents the frequency distribution of numerical data within a speciﬁc range The range of values

is divided into intervals called bins or cells

Draw histogram of the variable: roughness

Rcode:

Result:

# Histogram graph for the variable “roughness”

hist(data2$roughness,main = "Histogram of roughness",

xlab = "Roughness", col = "blue", labels = T, ylim =

c(0,15))

Trang 20

Explanation: The graph illustrates that the majority of samples have a roughness level predominantly falling within the 0 to 300 µm range Among the 50 samples examined, those displaying roughness between 50 to 100 µm and 150 to 200

µm are notably higher compared to the rest Conversely, samples within the 300 to

400 µm range are the least represented in the dataset

4.4.Boxplot of roughness relative to each categorical variable

Box plot is used to measure how well distributed is the data in the data set by graphically showing us the ﬁve numbers summary: minimum, maximum, median, ﬁrst quartile and third quartile of multiple data sets The minimum and maximum values are represented by two lines extending from each ends of the boxes (also known as whiskers) indicating variability between the upper quartile and lower quartile From which we can determine which data are outliers, represented by circles outside of the variability range

Boxplot relative to inﬁll pattern:

Rcode:

Result:

#Boxplot graph for the variable "inﬁll_pattern"

boxplot(data2$roughness ~ data2$inﬁll_pattern, main =

"Roughness of Inﬁll pattern", col = c(4,5), xlab =

"Inﬁll_ pattern", ylab = "Roughness")

Trang 21

With a grid pattern:

- The highest roughness is 368 µm

- The lowest roughness is 24 µm

- 25% of samples have the roughness of 92 µm or less

With a honeycomb pattern:

Trang 22

Boxplot graph for the variable "material"

Rcode:

Result:

#Boxplot graph for the variable "material"

boxplot(data2$roughness ~ data2$material, main =

"Roughness of Inﬁll pattern", col = c(2,3), xlab =

"Material", ylab = "Roughness")

Trang 23

Explanation: 75 % of samples have the roughness of 220 µm or less

Samples whose material is ABS:

Trang 24

With a honeycomb pattern:

Trang 25

5 INFERENTIAL STATISTICS

Use an appropriate linear regression model to evaluate the factors affecting the

roughness of the product after printing

First, we build the linear regression model with:

- The dependent variable : roughness

- The independent variables : layer_height, wall_thickness, inﬁll_density,

inﬁll_pattern, nozzle_temperature, bed_temperature, print_speed,

material, and fan_speed

The model is displayed as follows:

Tiêu đề	Linear regression model for predicting 3d print quality and strength
Tác giả	Trần Ngọc Vĩnh Quyền, Lê Lam Điền, Vũ Nhật Minh, Phạm Văn Nam, Khổng Hữu Thắng
Người hướng dẫn	Dr. Nguyen Tien Dung
Trường học	Ho Chi Minh University of Technology
Chuyên ngành	Applied Science
Thể loại	Báo cáo
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	48
Dung lượng	5,23 MB