Archives

  • 2018-07
  • 2018-10
  • 2018-11
  • 2019-04
  • 2019-05
  • 2019-06
  • 2019-07
  • 2019-08
  • 2019-09
  • 2019-10
  • 2019-11
  • 2019-12
  • 2020-01
  • 2020-02
  • 2020-03
  • 2020-04
  • 2020-05
  • 2020-06
  • 2020-07
  • 2020-08
  • 2020-09
  • 2020-10
  • 2020-11
  • 2020-12
  • 2021-01
  • 2021-02
  • 2021-03
  • 2021-04
  • 2021-05
  • 2021-06
  • 2021-07
  • 2021-08
  • 2021-09
  • 2021-10
  • 2021-11
  • 2021-12
  • 2022-01
  • 2022-02
  • 2022-03
  • 2022-04
  • 2022-05
  • 2022-06
  • 2022-07
  • 2022-08
  • 2022-09
  • 2022-10
  • 2022-11
  • 2022-12
  • 2023-01
  • 2023-02
  • 2023-03
  • 2023-04
  • 2023-05
  • 2023-06
  • 2023-07
  • 2023-08
  • 2023-09
  • 2023-10
  • 2023-11
  • 2023-12
  • 2024-01
  • 2024-02
  • 2024-03
  • Organization This paper is organized as follows Section disc

    2018-10-22

    Organization. This paper is organized as follows: Section 2 discusses related work. In Section 3, background and definitions of data quality concepts are presented. Then, Section 4 highlights data cleaning in medical applications. Section 5 presents the proposed ICCFD_Miner, MICCFD_Miner, and T_Repair techniques. Section 6 discusses the experimental study and results conducted for different medical datasets. Finally, Section 7 concludes the proposed work and highlights the future trends.
    Related work Unfortunately, despite the urgent need for precise and dependable techniques for enhancing data quality and data cleaning problems, there is no vital solution up to now to these problems. There has been little discussion and analysis about enhancing data consistency. However, most of the recent work focus on record matching and duplicate detection [2]. Database and data quality researchers have discussed a variety of integrity constraints based on Functional Dependencies (FD) [5,12,20,35]. In Ref. [35], authors propose an FD_Mine algorithm that discovers functional dependency from given relation. A survey and comprehensive comparison of seven algorithms for discovering functional dependencies are discussed in Ref. [27]. Surveyed algorithms include TANE, FUN, FD_Mine, DFD, Dep-Miner, FastFDs, FDEP as indicated extensively in Ref. [27]. Nevertheless, traditional FDs are developed mainly for schema design but are often not able to detect the semantic values errors of data. Other researchers focus on the extension of FD, they faah inhibitor have proposed what is so-called Conditional Functional Dependencies (CFD) and Conditional Inclusion Dependencies (CID) for capturing errors in data. Algorithms that proposed for discovering CFDs rules from relation include: CFD Miner algorithm for discovering constant conditional functional dependencies, CTANE algorithm that extends TANE to discover general CFDs, and FastCFD for discovering general CFDs by employing a depth-first search strategy instead of the level-wise approach as used in CTANE algorithm [6]. Several data quality techniques are proposed to clean messy tuples from databases [9], as researchers aim to find critical information missing from databases. In Ref. [9], authors propose three models to specify relative information completeness of databases from which both tuples and values may be missing. Statistical inference approaches are studied in Ref. [24], which infer missing information and correct errors automatically. These approaches tackle missing values to enhance the quality of data. From technological part, several open source tools are developed for handling messy data. Open Refine and Data Wrangler are two open source tools for working with missing data for cleaning them as detailed in Ref. [18]. Moreover, there are a variety of data transformation methods such as commercial ETL (Extract, Transformation, and Loading) tools [31]. Extraction methods focus on extracting data from homogeneous and/or heterogeneous data sources. Transformation methods purpose is to store data in proper format or structure for querying and analysis purpose. Loading methods concern with load data into a single data source repository such data warehouse or other unified data source depending on the requirements of the organization. These tools are developed for data cleaning to support any changes in the structure, representation or content of data. The usage of editing rules in combination with master data is discussed in Ref. [8]. Such rules can find certain fixes by updating input tuples with master data. According to constraints, editing rules have dynamic semantics and are relative to master data. Given an input tuple t that matches a pattern, editing rules tell us which attributes of given tuple t should be updated and what values from master data should be assigned to them. This approach requires defining editing rules manually for both relations, i.e., master relation and input relation, which is very expensive and time-consuming. Repairing use heuristic solution based on minimum cost function of two updates that not always provide with a deterministic fix. Editing rules require users to examine every tuple, which is expensive.