Intelligent Agents for Data Mining and Information Retrieval [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Intelligent Agents for Data Mining and Information Retrieval [Electronic resources] - نسخه متنی

Masoud Mohammadian

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
توضیحات
افزودن یادداشت جدید


"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">









  • PATTERN MATCH ANALYSIS

    While exploiting and researching the Market Basket Data Analysis System Based on Mining Associations Rules, we analyzed the pattern match included in mining association rules. We discovered some characteristics of the transaction data set and the set of candidates:

    Recording the item number and index can reduce the comparison time and enhance the efficiency of pattern match. Sort all the items of each transaction in the transaction data set alphabetically. When preprocessing the transaction data set, do the sort, record the item number of each transaction and set it as an index. Then, choose a suitable algorithm to generate the k-itemsets. We get alphabetized k-itemsets which match.



  • The itemsets in the set of candidates can be alphabetized, not only in row, but also in a column. The set of candidate k-itemsets generated by the join and prune steps, based on the set of candidate (k

    − 1)-itemsets and C1, is alphabetical. Then, through the SQL language as follows, we can get the alphabetic set of candidate k-itemsets in a row and in a column:


    INSERT INTO Ck
    SELECT TOP 100 PERCENT dbo.p.col1, dbo.p.col2,..., dbo.p.colk

    − 1, q. colk

    − 1 AS colk
    FROM dbo.lk

    − 1 p INNER JOIN dbo.lk

    − 1 q
    ON dbo.p.col1=q.col1 AND dbo.p.col2=q.col2 AND ... AND dbo.p.colk

    − 1<q.colk

    − 1
    ORDER BY P.col1, P.col2,..., P.colk

    − 1, Q.colk

    − 1


    We can deal with the pattern match by the alphabetic characteristic. For two alphabetic linear objects, we can find the most efficient lookup algorithm to realize match easily. Then, we should look for a suitable algorithm to transform the alphabetic pre-match object into an alphabetic linear object. The simplest transformation method is to set the item number as weight and add up k-itemsets with weight as the object of match.



  • The item number of the longest itemset in a transaction data set is far less than the total item number |I|, and the transactions having longer itemsets take up only a small part of the total transactions. When researching the market data set, which has 15,169 pieces of merchandise divided into 228 kinds of ware, we processed the mining based on 228 items, that is |I|=228. We discovered that the item number of the longest transaction in the transaction data set is 33; 0.098 percent of the transactions have an item number larger than 20; 0.62 percent have an item number larger than 15; and 3.9 percent have an item number larger than 10.



  • The matching data object is larger, and the matching success ratio is lower. When the k-itemsets number of each transaction in the transaction data set is larger, or when the transaction item number is larger, the itemset number of the set of candidates is larger. Then, the matching success ratio is lower. Owing to match being a process of lookup in nature, we can choose an algorithm which fits the match characteristic. In this chapter, that algorithm is bisearch, which has an average comparison time of log2n (n is the length of the object) and, therefore, can improve the efficiency.



  • The match proceeded between two alphabetical objefound that the k-itemsets number of each transaction in the transaction data set, when k went from 1 to some scale, was always less than, or far less than, the itemsets number of the set of candidates. But above the scale, the k-itemsets number of each transaction in the transaction data set was more than the itemsets number of candidates. The results show determine data object should be applied to match the other one between the transaction data set and the set of candidates. But what's the determining principlecy.



  • Let the average transaction number of transaction itemsets in transaction data set DB be h. Then, the average k-itemsets number in each transaction is Ckh. And, let there be m k-itemsets in candidate k-itemsets set Ck. Using the bisearch method, if we take each candidate in candidate k-itemsets set Ck to match the Ckh-itemsets of each transaction in the transaction data set, then the total comparison time is mlog2 Ckh; conversely, if we take the Ckh k-itemsets of each transaction in the transaction data set to match each candidate in candidate k-itemsets set Ck, the total compare time is Ckhlog2m. Which one measure we should take is decided by the big and small of mlog2Ckh and Ckhlog2m.

    Then compare the big and small of mlog2Ckh and Ckhlog2m.


    When x>e(const), then ln>1, h

    ′ (x)<0, function h(x) is a monotone decrease function, and then

    is tenable. That is, when e<x<y, xlog2y<ylog2x.

    To the pattern match of mining for association rules, there is Ckh>h>e and m is a positive integer.

    When m>e, if Ckh<m, then Ckhlog2m<mlog2C, and we choose to take the Ckh k-itemsets of each transaction in the transaction data set to match each candidate in candidate k-itemsets set Ck, which has a higher efficiency. On the contrary, if Ckh>m, then Ckhlog2m>mlog2Ckh, and we choose to take each candidate in candidate k-itemsets set Ck to match the Ckh k-itemsets of each transaction in the transaction data set, which has the higher efficiency.

    When m=2, then Ckh>2log2Ckh; and when m=1, the question is changed into the match between one single data and an array of data. Therefore, when m<e, take each itemset in candidate k-itemsets set Ck to match the Ckh k-itemsets of each transaction in the transaction data set.

    From the above, when Ckh<m, take k-itemsets of each transaction in the transaction data set to match the candidates set. Contrarily, when Ckh>m, take the candidate k-itemsets set to match k-itemsets of each transaction in the transaction data set.

    We can summarize the flow of the pattern match being fit for association rules as follows, through the preview analysis:

    If k-itemsets of each transaction in the transaction data set match the candidate k-itemsets set, we should dispose of each k-itemsets of the candidate k-itemsets set by adding up their weights and putting them into a linear array to wait for match. Then, deal with current k-itemsets of the current transaction in the transaction data set by adding up the weight for each match, which is a match between a single data and an array, then choose, and then dispose and match the next k-itemsets of transaction. That method can be an effective use of the alphabetic characteristic of the transaction data set to make it more efficient. We can use a similar method when taking the candidate k-itemsets set to match k-itemsets of each transaction in the transaction data set.

    / 171