Skip to main content

Currently Skimming:

Data Mining on Large Graphs
Pages 265-286

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 265...
... , ANF's highly-accurate approximation is more than 700 times faster than the exact computation. This reduces the running time from nearly a day to a matter of a minute or two, allowing users to perform ad hoc drill-dowr~ tasks and to repeatedly answer questions about changing data sources.
From page 266...
... There ale many more examples of graphs which contain interesting information for data mining purposes. For example, the telephone calling records Tom a long distance carrier can be viewed ~ a graph, and by mining the graph we may help identify Fraudulent behaviour or marketing opportunities.
From page 267...
... Instead, we produce a data mining tool that is useful given any amount of RAM. These two goals give rise to the following list of properties that our tool must satisfy when analyzing a graph with n nodes and m edges: Error guarantees: estimates must be accurate at all distances (not just in the limit)
From page 268...
... .:/ . ,0 ~ ~ ,~ Figure 1: ANF algorithms scale but not the others Sequential scans of the edge file: avoid random accesses to the graph.
From page 269...
... : u'v ~ V) Diameter of the graph Starting set for the neighbourhood Concluding set for the neighbourhood Number of extra approximation bits Number of parallel approximations Def.
From page 270...
... Given that we are interested only in very large graphs and graphs with directed edges, this does not solve the problems related to large edge files. Instead, we will need to find a new computation strategy which avoids random access to disk.
From page 271...
... 3.1 Basic ANF Algorithm (ANF-O) A graph traversal effectively accesses the edge file in random order.
From page 272...
... Example. Figure 3 shows the bitmasks and approximations for a simple example of our most basic ANF algorithm.
From page 274...
... In both cases, we can further improve ANF by reducing the number of bits of data that must be stored. First, observe that, as ANF runs, most of the 274 DYNAMIC SOCIAL NETWORK MODELING AND ANALYSIS
From page 275...
... /.77351 est += IN+~, h, C`) where b is the average position of the least zero bits in the k bitmasks output N+ (h, S
From page 276...
... approximation scheme —~ sat— —_ The R] approbation algorithm is based on the approximate counting scheme proposed in [4]
From page 277...
... We use three real data sets: Router: Undirected Internet routers data from ISI [19] , including scans done by Lucent Bell Laboratories t124.
From page 278...
... To measure the sensitivity we averaged the RMS error over 10 trials for different values of r and the different data sets. These results appear 278 DYNAMIC SOCIALNT:TWORKMOD~LINGANDx4NALYSIS
From page 279...
... We then increase s and t proportionately to generate larger DYNAMIC SOCIAL NETWORK MODELING ED ^^YSIS 279
From page 280...
... . · · ~ Cornell cycle grid urutorm 80-20 core router Data Sot 20 _ 15 .
From page 281...
... ANF Speed-up Uniform 92 0.34 270x Cora 6 1.4 4x 8~20 680 0.9 756x Router 1,200 1.7 70ax 4.2.4 Speed and Scalability Table 3 reports wall-clock running times on an otherwise unloaded Pentium II-450 machine for both the exact computation (Breadth-First search) and ANF with k = 64 parallel approximations.
From page 282...
... Delta importance Figure 9: ANF finds the best starting move for ~ 5 Data mining with our ANF Too! With our highly-accurate and efficient approximation tool, ANF, it is now possible to answer some of the prototypical graph mining questions that we posed in the introduction.
From page 283...
... We then cluster these graphs by computing the hop exponents and forming clusters that have similar hop exponents (less than 0.1 difference)
From page 284...
... This type of study was infeasible before our ANF tool as an exact computation would have required over a year of computation time. 6 Conclusions In this paper we presented 10 interesting data mining questions on graph data, proposed an efficient and accurate approximation algorithm that gives us the tool, ANF, we needed to answer these questions, and presented results for three of these questions on real-world data.
From page 285...
... points. Can be parallelized: Our ANF algorithm may be parallelized with very few synchronization Employs sequential scans: Unlike prior approximations of the neighbourhood function, our algorithm avoids random access of the edge file and performs one sequential scan of the edge file per hop.
From page 286...
... An apriori-based algorithm for mining frequent substructures from graph data. In PDKK, pages 13-23, 2000.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.