Kết quả thực nghiệm và đánh giá

CHƯƠNG 3 GIẢI THUẬT VÀ THỰC NGHIỆM

3.5. Kết quả thực nghiệm và đánh giá

Ứng với mỗi kịch bản thực nghiệm Nhóm 1 và Nhóm 2, kết quả thực nghiệm được ghi nhận để so sánh độ hiệu quả của các giải thuật dựa trên các tiêu chí sau đây:

 Thời gian chạy (CPU runtime) của giải thuật

 Số lần gọi hàm tính khoảng cách Euclidean trong giải thuật Kết quả thực nghiệm Nhóm 1:

Lần lượt thực nghiệm trên 05 tập dữ liệu kiểm thử được trình bày trong phần 3.3, kết quả thời gian chạy và số lần gọi hàm khoảng cách của cả hai giải thuật HOTSAX và IDD được trình bày lần lượt trong các bảng từ Bảng 3.4 đến Bảng 3.8 và chuỗi con bất thường tìm được lần lượt được trình bày trong các hình từ Hình 3.2 đến Hình 3.6.

Các kết quả thực nghiệm trong Nhóm 1 cho thấy cả hai giải thuật HOTSAX và IDD đều trả về cùng chuỗi bất thường nhưng giải thuật IDD có thời gian chạy nhanh hơn đáng kể so với giải thuật HOTSAX mặc dù số lần gọi hàm khoảng cách của giải thuật IDD cao hơn một ít so với giải thuật HOTSAX trong một số trường hợp. Giải thuật IDD chạy nhanh hơn giải thuật HOTSAX là bởi vì giải thuật IDD có được ưu điểm tính toán nhanh của độ đo thông tin WDen, trong khi đó giải thuật HOTSAX phải mất chi phí cho việc xây dựng một mảng và một cây để tạo các Heuristic này cho hai vòng lặp trong bước tiếp theo của giải thuật.

Ngoài ra, kết quả thực nghiệm trong Nhóm 1 cũng cho thấy rằng giải thuật IDD có hiệu quả ổn định hơn giải thuật HOTSAX khi chạy trên 5 tập dữ liệu kiểm thử khác nhau. Nhìn chung, hầu hết các thực nghiệm cho thấy giải thuật HOTSAX có hiệu quả cao nhất khi a = 3 trong khi đó giải thuật IDD có hiệu quả cao nhất khi a = 21. Do đó, trong các thực nghiệm Nhóm 2 sau đây, tác giả thiết lập cố định a = 3 cho giải thuật HOTSAX và a = 21 cho giải thuật IDD.

Bảng 3.4: Cố định giá trị w = 5, thay đổi giá trị a trên tập dữ liệu VIDEO

Tập dữ liệu VIDEO, chiều dài discord n = 200, Kích thước từ w = 5 Alphabet size (a) Running Time (s) Distance function calls

HOTSAX IDD HOTSAX IDD

3 169 109 799.163 1.098.983

6 148 74 363.384 856.316

9 141 59 322.863 594.373

12 155 52 308.173 531.289

15 154 48 445.538 493.074

18 158 45 529.700 465.488

21 179 43 716.054 472.678

Hình 3.2: Chuỗi con bất thường của tập dữ liệu VIDEO

Bảng 3.5: Cố định giá trị w=5, thay đổi giá trị a trên tập dữ liệu ECG Tập dữ liệu ECG, chiều dài discord n = 256, kích thước từ w = 5 Alphabet size (a) Running Time (s) Distance function calls

HOTSAX IDD HOTSAX IDD

3 2.625 1.128 2.115.030 4.173.756

6 2.165 806 2.035.393 3.215.225

9 2.315 691 1.807.239 3.329.775

12 2.126 554 2.367.130 2.845.442

15 2.186 552 2.458.820 2.472.502

18 2.215 495 3.359.932 2.703.755

21 2.221 459 3.687.119 2.713.265

Hình 3.3: Chuỗi con bất thường của tập dữ liệu ECG

Bảng 3.6: Cố định giá trị w=5, thay đổi giá trị a trên tập dữ liệu POWER

Tập dữ liệu POWER, chiều dài discord n = 750, kích thước từ w = 5 Alphabet size (a)

Running Time (s) Distance function calls

HOTSAX IDD HOTSAX IDD

3 2.343 1.508 7.338.545 10.203.248

6 2.312 1.581 4.890.053 7.566.715

9 2.117 938 2.700.775 7.363.209

12 2.279 1.001 2.119.408 9.615.270

15 2.061 901 1.566.813 7.770.451

18 2.005 867 1.778.374 7.911.583

21 2.011 725 1.483.770 6.676.988

Hình 3.4: Chuỗi con bất thường của tập dữ liệu POWER

Bảng 3.7: Cố định giá trị w=5. thay đổi giá trị a trên tập dữ liệu PATIENT Tập dữ liệu PATIENT, chiều dài discord n = 150, kích thước từ w = 5 Alphabet size

(a)

Running Time (s) Distance function calls

HOTSAX IDD HOTSAX IDD

3 100 69 513,458 571,752

6 95 50 423,480 506,449

9 113 42 340,365 447,809

12 101 37 491,117 407,768

15 107 37 626,552 469,552

18 116 34 784,931 469,123

21 116 33 990,942 451,602

Hình 3.5: Chuỗi con bất thường của tập dữ liệu PATIENT

Bảng 3.8: Cố định giá trị w=10, thay đổi giá trị a trên tập dữ liệu SpaceShuttle Tập dữ liệu SpaceShuttle, chiều dài discord n = 100, kích thước từ w = 10 Alphabet size (a) Running Time (s) Distance function calls

HOTSAX IDD HOTSAX IDD

3 162 158 403,986 424,596

6 148 110 227,257 153,840

9 154 92 429,210 238,469

12 155 73 564,910 134,071

15 164 72 875,894 227,979

18 165 63 983,927 168,667

21 183 64 1,264,173 165,303

Hình 3.6: Chuỗi con bất thường của tập dữ liệu SpaceShuttle

Kết quả thực nghiệm Nhóm 2:

Tương tự phương pháp thực nghiệm ở Nhóm 1, lần lượt thực nghiệm trên 05 tập dữ liệu kiểm thử được trình bày trong phần 3.3, kết quả thời gian chạy và số lần gọi hàm khoảng cách của cả hai giải thuật HOTSAX và IDD được trình bày lần lượt trong các bảng từ Bảng 3.9 đến Bảng 3.13.

Bảng 3.9: Cố định giá trị a, thay đổi giá trị w trên tập dữ liệu VIDEO Tập dữ liệu VIDEO, chiều dài discord n = 200

Word size

(w) PosStart PosEnd Running Time (s) Distance function calls HOTSAX IDD HOTSAX IDD

5 2.093 2.293 169 43 799.163 472.678

10 2.093 2.293 129 92 310.425 522.219

15 2.093 2.293 139 110 375.217 662.279

20 2.093 2.293 173 118 505.380 756.721

25 2.093 2.293 226 143 503.848 783.678

Bảng 3.10: Cố định giá trị a, thay đổi giá trị w trên tập dữ liệu ECG Tập dữ liệu ECG, chiều dài discord n = 256

Word size

(w) PosStart PosEnd

Running Time (s) Distance function calls

HOTSAX IDD HOTSAX IDD

5 9,722 9,978 2,625 459 2,115,030 2,713,265

10 9,722 9,978 2,242 837 2,338,959 2,385,067

15 9,722 9,978 2,174 799 3,210,908 2,385,067

0 9,722 9,978 2,290 1,409 4,367,491 2,580,544

25 9,722 9,978 2,381 1,767 5,303,408 2,604,745

Bảng 3.11: Cố định giá trị a, thay đổi giá trị w trên tập dữ liệu POWER Tập dữ liệu POWER, chiều dài discord n = 750

Word size

(w) PosStart PosEnd Running Time (s) Distance function calls

HOTSAX IDD HOTSAX IDD

5 6,101 6,851 2,343 725 7,338,545 6,676,988

10 6,101 6,851 2,597 1,251 2,537,857 8,978,958

15 6,101 6,851 2,131 1,409 1,653,447 6,474,467

20 6,101 6,851 2,247 1,742 1,666,488 5,577,241

25 6,101 6,851 2,299 2,134 3,075,539 5,359,777

Bảng 3.12: Cố định giá trị a, thay đổi giá trị w trên tập dữ liệu PATIENT Tập dữ liệu PATIENT, chiều dài discord n = 150

Word size

(w) PosStart PosEnd Running Time (s) Distance function calls

HOTSAX IDD HOTSAX IDD

5 3,315 3,465 100 37 513,458 451,602

10 3,315 3,465 93 51 305,235 267,888

15 3,315 3,465 122 72 618,564 250,389

20 3,315 3,465 133 94 851,674 282,096

25 3,315 3,465 118 124 1,166,504 314,168

Bảng 3.13: Cố định giá trị a, thay đổi giá trị w trên tập dữ liệu SpaceShuttle Tập dữ liệu SpaceShuttle, chiều dài discord n = 100

Word size

(w) PosStart PosEnd Running Time (s) Distance function calls

HOTSAX IDD HOTSAX IDD

5 4,246 4,346 166 48 708,873 223,817

10 4,246 4,346 162 64 403,986 165,303

15 4,246 4,346 153 105 488,066 181,313

20 4,246 4,346 153 106 422,168 180,911

25 4,246 4,346 164 131 533,487 257,547

Kết quả thực nghiệm trong Nhóm 2 cho thấy giải thuật IDD có thời gian chạy ít hơn đáng kể so với giải thuật HOTSAX, đặc biệt khi tham số w có giá trị nhỏ. Kết quả thực nghiệm trên 5 tập dữ liệu cho thấy khi tham số w có giá trị càng tăng thì thời gian chạy của giải thuật đề xuất IDD cũng tăng theo. Trong khi đó, giải thuật HOTSAX có thời gian chạy không tăng tuyến tính theo giá trị tăng của tham số w. Điều này được giải thích như sau: các dòng lệnh từ dòng 1 đến dòng 6 của giải thuật IDD được trình bày trong Bảng 3.1 cho thấy chi phí thời gian chạy của giải thuật IDD phụ thuộc vào giá trị tham số w, giá trị tham số w càng nhỏ thì thời gian chạy của giải thuật càng nhỏ.

Ngoài ra, khi xét về chi phí bộ nhớ thì giải thuật đề xuất IDD có hiệu quả hơn so với giải thuật HOTSAX. Như được trình bày trong Bảng 3.1, giải thuật đề xuất IDD chỉ sử dụng 04 mảng để lưu trữ các giá trị E(Ci), W(Ci), DenCj(xi) và WDen(xi) để tạo các Heuristic cho hai vòng lặp trong bước tiếp theo của giải thuật.

Trong 04 mảng trên, 03 mảng có kích thước w và 01 mảng còn lại có kích thước (m - n + 1), là số lượng chuỗi con dự tuyển. Vì tham số w thường có giá trị rất nhỏ so với m nên hầu như toàn bộ không gian bộ nhớ được sử dụng để chạy thuật toán là (m - n + 1). Trong khi đó, giải thuật HOTSAX có chi phí bộ nhớ khá lớn vì sử dụng một mảng và một cây để xét các chuỗi con trong hai vòng lặp lồng nhau. Chi phí bộ nhớ cho giải thuật HOTSAX là (m - n + 1)*w. Do đó, giải thuật đề xuất IDD giảm được chi phí bộ nhớ so với giải thuật HOTSAX là w lần.

KẾT LUẬN VÀ HƯỚNG PHÁT TRIỂN

 Kết quả của luận văn

Luận văn đã khảo sát về khai phá dữ liệu chuỗi thời gian, nghiên cứu và đề xuất một phương pháp giải quyết bài toán phát hiện chuỗi bất thường trên dữ liệu chuỗi thời gian, bao gồm:

 Đề xuất giải thuật phát hiện chuỗi bất thường

 Đề xuất một phương pháp đánh giá hiệu quả của giải thuật phát hiện chuỗi bất thường, kết hợp đánh giá dựa trên số lần gọi hàm tính khoảng cách và thời gian chạy của giải thuật.

Luận văn tiến hành hiện thực giải thuật phát hiện chuỗi bất thường đề xuất IDD, bao gồm các bước sau:

 Thu giảm số chiều theo phép biến đổi PAA

 Rời rạc hóa dữ liệu theo phép biến đổi SAX

 Giải thuật phát hiện chuỗi bất thường dựa vào giải thuật HOTSAX và độ đo thông tin WDen.

Nhờ vào cách tiếp cận sử dụng độ đo thông tin WDen để xây dựng các Heuristic thay vì xây dựng mảng và cây như HOTSAX nên giải thuật đề xuất IDD có thời gian chạy nhanh hơn đáng kể so với giải thuật HOTSAX. Bên cạnh đó, giải thuật đề xuất IDD cũng có hiệu quả chạy ổn định hơn đáng kể so với HOTSAX khi thực nghiệm trên nhiều tập dữ liệu khác nhau. Ngoài ra, giải thuật đề xuất IDD còn cải thiện chi phí bộ nhớ đáng kể so với giải thuật HOTSAX vì không sử dụng cấu trúc chỉ mục cây để thiết lập Heuristic cho hai vòng lặp lồng nhau.

Luận văn đã cài đặt giải thuật đề xuất IDD và giải thuật HOTSAX để tiến hành các kịch bản thực nghiện, so sánh và đánh giá kết quả thực nghiệm. Các kết quả thực nghiệm trên 05 tập dữ liệu kiểm thử đã một lần nữa khẳng định hiệu quả về chi phí thời gian chạy, chi phí bộ nhớ và độ ổn định của giải thuật đề xuất IDD so với giải thuật HOTSAX.

 Ứng dụng tiềm năng

 Ứng dụng trong đo điện tim cho người bệnh đang được điều trị tích cực, cần theo dõi liên tục và được điều trị dài ngày. Dữ liệu kết quả đo điện tim được đưa vào ứng dụng cài đặt giải thuật phát hiện chuỗi bất thường đề xuất của luận văn, nếu có giá trị bất thường trong dữ liệu điện tim thì ứng dụng sẽ cảnh báo, giúp bác sĩ kịp thời theo dõi bất thường về số đo điện tim cho bệnh nhân.

 Ứng dụng trong hệ thống cảnh báo mực nước của các hồ chứa nước:

mực nước được đo liên tục và dữ liệu kết quả đo được đưa vào ứng dụng cài đặt giải thuật phát hiện chuỗi bất thường đề xuất của luận văn.

Nếu dữ liệu đo mực nước có giá trị bất thường thì ứng dụng sẽ cảnh báo đến các chuyên gia, giúp chuyên gia kịp thời kiểm tra, xử lý.

 Hướng phát triển

 Khảo sát và phân tích đặc điểm của các tập dữ liệu thực nghiệm và tiến hành thực nghiệm trên nhiều tập dữ liệu có các đặc tính khác nhau để có thể rút được nhiều kết luận về giải thuật đề xuất.

 Nghiên cứu nhiều độ đo thông tin khác và cải tiến giải thuật đề xuất IDD để có thể giảm số lần gọi hàm tính khoảng cách.

DANH MỤC TÀI LIỆU THAM KHẢO

[1] Q. Yang and X. Wu, (2006), “10 Challenging Problems in Data Mining Research”, International Journal of Information Technology and Decision Making, vol. 5, pp. 597-604.

[2] D. Cheboli, (2010), “Anomaly Detection of Time Series”, Master Thesis.

[3] F. Chuah, M.C. Fu., “ECG anomaly detection via time series annalysis”, Lecturer notes in computer science, pp. 123-135.

[4] S. Akhavan and G.Calva, (1998), “Automatic anomaly detection in ECG signal by fuzzy decision making”, In Proceedings of 6th International Conference on Fuzzy Theory and Technology: Association for Intelligent Machinery, pp. 23-28.

[5] Sheng Zhang, Amit Chakrabarti, James Ford, and Fillia Makedon, (2006), Attack detection in time series for recommender systems, in KDD ’06:

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, 2006, pp. 809-814.

[6] David L. Iverson, (2004), “Inductive system health monitoring”, International Conference on Artificial Intelligence.

[7] Manue Bicego and Vittorio Murino, (2004), “Investigating hidden markov models’ capabilities in 2d shape classification”, IEEE Trans. Pattern Anal.

March. Intell., Vol. 26, No. 2, pp. 281-286.

[8] Zheng Liu, J.X. Yu, Lei Chen, and Di Wu, (2008), “Detection of shape anomalies: A probabilistic approach using hidden markov models”, IEEE 24th International Conference on Data Engineering, pp. 1325-1327.

[9] Li Wei, E. Keogh, and Xiaopeng Xi, (2006), “Saxually explicit images:

Finding unusual shapes”, Data Mining, 2006. ICDM ’06. Sixth International Conference on, pp. 711-720.

[10] P. Protopapas, J. M Giammarco, L. Faccioli, M. F. Struble, R. Dave, and C.

Alcock, (2006), “Finding outlier light-curves in catalogs of periodic variable stars”, MON.NOT.ROY.ASTRON.SOC, pp. 369-677.

[11] U.Rebbapragada, P. Protopapas, C. E. Brodley, and C. Alcock, (2008),

“Finding anomalous periodic time series”, the Machine Learning Journal.

[12] Haibin Cheng, Pang-Ning Tan, Christoper Potter, and Steven Klooster, (2009),

“Detection and characterization of anomalies in multivariate time series”, In Proceedings of the ninth SIAM International Conference on Data Mining.

[13] Z. Y. He, S. C. Deng and X. F. Xu, (2005), “An optimization model for outlier detection in categorical data”, Proc. of International Conference on Advances in Intelligent Computing, Hefei, China, pp. 400-409.

[14] Keogh, E., Lin, J. and Fu, (2005), “A HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence”, In: Proc. of 5th IEEE Int. Conf. on Data Mining (ICDM), pp. 226-233.

[15] Fu, O. Leung, E. Keogh, and J. Lin, (2006), “Finding time series discords based on Haar transform”, In Proc. of the 2nd International Conference on Advanced Data Mining and Applications, pp. 31–41.

[16] Wei L, Keogh E, Xi X (2006), “SAX ually explicit images: finding unusual shapes”, In: Proceedings of the 2006 IEEE international conference on data mining, Hong Kong, pp. 18–22.

[17] Anscombe, F. J. and Guttman, I. (1960), “Rejection of outliers”, Technometrics, Vol. 2, No. 2, pp. 123–147.

[18] Z. Y. He, S. C. Deng and X. F. Xu, (2006), “A fast greedy algorithm for outlier mining”, Proc. of the 10th Pacific Asia Conference on Knowledge and Data Discovery, pp. 567-576.

[19] Bu, Y., Leung, T.W., Fu, A., Keogh, E., Pei, J., Meshkin, (April 2007), WAT:

finding Top-K discords in time series database, In: Proceedings of the 2007 SIAM International Conference on Data Mining (SDM 2007), Minneapolis, MN, USA, 26–28.

[20] X. W. Zhao, J. Y. Liang and F. Y. Cao, (2014), “A simple and effective outlier detection algorithm for categorical data”, International Journal of Machine Learning Cybernet, vol.5, no.3, pp. 469-477.

[21] G. Li, O. Braysy, L. Jiang, Z. Wu, and Y. Wang, (2013), “Finding time series discord based on bit representation clustering”, Knowledge-Based Systems, vol. 54, pp. 243–254.

[22] Nguyễn Nam, (2014), Giá vàng hôm nay: Chứng khoán và đồng đô la làm giảm giá vàng, http://vietq.vn/gia-vang-hom-nay-chung-khoan-va-dong-do-la- lam-giam-gia-vang-d36727.html. 20/09/2017.

[23] E. Keogh and T. Folias. In The ucr time series data mining archive.

[24] A. L. Goldberger et al, (2000), “Physiobank physiotoolkit and physionet:

Components of a new research resource for complex physiologic signals”.

[25] Deepthi Cheboli Varun Chandola and Vipin Kumar, “Detecting anomalies in a time series database”, Technical report, 09-004.

[26] Michail Vlachos, Marios Hadjieleftheriou, Dimitrios Gunopulos, and Eamonn Keogh. Indexing multi-dimensional time series with support for multiple distance measures. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 216- 225, New York, NY, USD, 2003. ACM.

[27] B. Abraham and A. Chuang, (1989), “Outlier detection and time series modeling”, Technometrics, vol. 31, no. 2, pp. 241-248.

[28] B. Abraham and G. E. P. Box, (1979), “Bayesian analysis of some outlier problems in time series”, Biometrika, vol. 66, vo. 2, pp. 229-236.

[29] A. J. Fox, (1972), “Outliers in time series”, Journal of the Royal Statistical Society, Series B (Methodological), vol. 34, no. 3, pp. 350-363.

[30] L. Rabiner and B. Juang, (Jan 1986), “An introduction to hidden markov models”, ASSP magazine, IEEE, vol. 3, no. 1, pp. 4-16.

[31] E. Keogh, K. Chakrabarti, M. Pazzani and S. Mehrotra, (2001),

“Demensionality reduction for fast similarity search in large time series databases”, Journal of Knowledge and Information System, vol. 3, no. 3, pp.

263-286.

[32] Zheng Liu, J.X. Yu, Lei Chen, and Di Wu, (2008), “Detection of shape anomalies: A probabilistic approach using hidden markov models”, IEEE 24th International Conference on Data Engineering, pp. 1325-1327.

[33] Keogh. E. (2005), www.cs.ucr.edu/~eamonn/discords/.

PHỤ LỤC 1 – BÀI BÁO KHOA HỌC ĐÃ CÔNG BỐ

An Information theory-based Approach for Time Series Discord Discovery

Thuc-Doan Do1,2, Hong-Van T. Hoang1,2, Cam Huynh3, Thuy-Van T. Duong1,2*

1 Center for Applied Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam.

2 Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam.

3Faculty of Information Technology, Post and Telecommunication Institute of Technology, Vietnam { dothucdoan; hoangthihongvan; duongthithuyvan }@tdt.edu.vn

*Corresponding author: duongthithuyvan@tdt.edu.vn huynhcam@gmail.com

Received February 2018; accepted April 2018

ABSTRACT. Time series discord is a subsequence which is the most different from the remaining subsequences of a time series. In this paper, we propose a novel approach based on information theory to discover time series discord. Many previous papers rely on prebuilding an index for subsequences to make heuristic for pruning away unnecessary searches. However, building such data structures has a significant effect on running performance and memory cost. In our approach, using information theory rather than data structure reduces the preprocessing time and space to make heuristic, and helps our method run faster with less space. Experimental results on some time series datasets demonstrate the effectiveness and efficiency of our proposed algorithm compared to HOTSAX while the discords discovered by two methods completely match each other.

Keywords: Discord discovery, Time series, Outlier detection, Information theory, Entropy

1. Introduction. Time series discord is defined as a subsequence of a longer time series that is maximally different to other subsequences. Time series discord discovery plays an important role and has proven to be useful for data mining such as data cleaning, data description, clustering. It captures the idea of anomalous subsequences in time series but takes only one intuitive parameter - the length of the subsequence while most anomaly detection algorithms require many parameters.

A simple way to find the discord is the brute force algorithm. It requires comparisons among O(m2) pair-wise distances, where m is the length of the time series. Therefore, it is not applicable in practice. To search for discords more efficiently, most efforts focus on building heuristics for two nested loops in the brute force. They are based on data structures to reorder the subsequences with the hope that it is possible to exit quickly from two nested loops. It takes time and space to build such data structures, some of which often have unintuitive parameters (e.g., word length and alphabet size) to tune. There is a trade- off between effectiveness and efficiency in this case. That is the reason why evaluating the performance of the algorithm based on only the number of calls to distance function is not exact. In this paper, we propose a novel approach to discover time series discord, which is information theory-based to reduce the time and space compared to the above-mentioned methods. The main contributions of our work are the following:

 Propose a new solution that reduces the preprocessing time and space so that it can

work efficiently in the whole process of discord discovery.

 Do the experiments on the various data sets to show the efficiency of our algorithm.

 Compare the results to HOT SAX and discuss obtained results.

The rest of the paper is organized as follows. In Section 2, related works on outlier detection and time series discord discovery are presented. Then, we give some essential definitions, present generic framework to solve discord discovery and propose our solution in Section 3. Section 4 presents the experiments, results obtained and discussion. Finally, Section 5 draws the conclusion and future works.

2. Related Works.

2.1. Outlier Detection. Outlier is defined in [1] as an observation that deviates so much from other observations as to arouse suspicion that it is generated from a different mechanism. In [2], approaches for outlier detection consist of seven groups: classification- based, clustering-based, nearest neighbor-based, statistical, information theory-based, spectral decomposition-based, visualization-based.

Among these approaches, information theory-based approach has a solid foundation as it is based on mathematics and yields promising results. This method uses information theoretic measures, such as entropy and relative uncertainty to compute the disorder of a dataset after removing outliers as it assumes that outliers cause irregularities in the data and affect information consistency. In some papers, the problem of outlier detection is considered as an optimization problem and solved by a local search heuristic-based algorithm [3] and a greedy algorithm [4] to minimize the objective function. In [5], outlier detection is also considered as an optimization problem, in which the efficiency is improved by combining the correlation among attributes and the importance of each attribute. In [6], the authors proposed an algorithm based on entropy using a weighted density definition for categorical data and showed the superiority of that algorithm compared with the existing outlier detection methods.

2.2. Time Series Discord Discovery. When finding outliers within a time series, outliers can be points or subsequences. When it comes to the later one, the most famous problem is discord discovery, which was firstly defined in [7] and has confirmed the utility in many fields [8, 9]. While few papers find discords approximately like [10] with the idea that subsequences in a smaller sized cluster have the higher probability to be a discord, most studies find exact discords by building heuristics for two nested loops in the brute force.

The authors in [7] suggested an algorithm called HOT SAX with a heuristic technique for pruning quickly the data space and focusing only on the potential discords. They use two data structures: an array of SAX word and a trie to support in approximating the magic outer and inner loops. In [11], WAT algorithm is based on Haar wavelet and augmented trie to mine the top-K discords from a time series database. It provides better pruning power and can dynamically determine the word size. Vector quantization, a time series discretization technique, is used to reduce dimensions and discretize time series data. In [12], the authors create two data structures: a table of VQ representations for subsequences of the input data and an augmented trie to lookup occurrences of VQ representations.

3. Problem and Solution.

3.1. Problem Definition.

Definition 1. Time Series Discord: Given a time series T, the subsequence D of length n beginning at position l is said to be the discord of T if D has the largest distance to its

Dựa vào cửa sổ trượt (Window based)

Hướng giải quyết vấn đề