Web Mining Research Support System Computer Science Essay

Application of data mining ways to the World Wide Web, referred to as Web mining has been the focus of several recent studies and papers. However, there is absolutely no established vocabulary resulting in confusion when you compare research efforts. The term Web mining has been used in two distinct ways. The first, called Web content mining in this paper is the process of information discovery from sources across the World Wide Web. The second called Web consumption mining is the process of mining for user browsing and access patterns. With this paper we define Web mining and present a synopsis of the many research issues, techniques, and development efforts. We briefly describe WEBMINER, something for Web usage mining, and conclude this paper by listing research issues.

Introduction

The evolution of the internet has brought us enormous and ever growing levels of data and information. It inuences virtually all areas of people's lives. In addition, with the abundant data provided by the web, it is becoming an important resource for research. Furthermore, the reduced cost of web data makes it more attractive to researchers.

Researchers can retrieve web data by browsing and keyword searching [58]. However, there are several limitations to these techniques. It is hard for researchers to retrieve data by browsing because there are many following links contained in a website. Keyword searching will return a large amount of irrelevant data. On the other hand, traditional data extraction and mining techniques can't be applied right to the web due to its semi-structured or even unstructured nature. Webpages are Hypertext documents, which contain both text and hyperlinks to other documents. Furthermore, other data sources also exist, such as e-mail lists, newsgroups, forums, etc. Thus, design and implementation of your web mining research support system has turned into a challenge for folks with interest in utilizing information from the web for his or her research.

A web mining research support system can identify web sources according to research needs, including identifying availability, relevance and need for web sites; it should be in a position to select data to be extracted, because a website 1 contains both relevant and irrelevant information; it should be able to analyze the data patterns of the collected data and help to build models and offer validity.

A Taxonomy of Web Mining

In this section I will present the taxonomy of web mining. Actually, web Mining can be broadly split into three distinct categories, based on the sorts of data to be mined. I will give a description and categorization of a few of the recent work. in addition to this, I will give some tools and techniques related to each area.

2-1. Web Content Mining

The insufficient structure that permeates the information sources on the internet makes automated discovery of Web-based information difficult. Traditional search engines such as Lycos, Alta Vista WebCrawler, ALIWEB [29], MetaCrawler, and more provide some comfort to users, but do not generally provide structural information nor categorize, filter, or interpret documents. A recently available study offers a comprehensive and statistically thorough comparative analysis of the very most popular search engines.

In recent years these factors have pushed researchers to build up more intelligent tools for information retrieval such as intelligent Web agents, and also to extend data mining ways to provide a more impressive range of organization for semi-structured data on the Web. We summarize some of these efforts below.

2-1-1. Agent-Based Approach:

Generally, agent-based Web mining systems can be placed into the following three categories.

Intelligent Search Agents: Several intelligent Web agents have been developed that search for relevant information using domain characteristics and user profiles to arrange and interpret the learned information. Agents such as Harvest [9] FAQ-Finder [19] Information Manifold [27] OCCAM [30] and ParaSite [51] rely either on pre-specified domain information about particular types of documents, or on hard coded types of the information sources to retrieve and interpret documents_ Agents such as Shop-Bot [14] and ILA (Internet Learning Agent) [42] connect to and learn the structure of unfamiliar information sources. ShopBot retrieves product information from a variety of vendor sites using only general information about the product domain. ILA learns models of various information sources and translates these into its own concept hierarchy.

Information Filtering/ Categorization: A number of Web agents use various information retrieval techniques [17] and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them[5, 9, 34, 55, 53] HyPursuit [53] uses semantic information embedded in link structures and document content to create cluster hierarchies of hypertext documents, and structure an information space. BO (Bookmark Organizer [34] combines hierarchical clustering techniques and user interaction to arrange a assortment of Web documents based on conceptual information.

Personalized Web Agents: This group of Web agents learn user preferences and find out Web information sources predicated on these preferences, and the ones of other individuals with similar interests (using collaborative filtering). Several recent examples of such agents are the WebWatcher[3], PAINT [39]. Syskill & Webert [41]. GroupLens [47]. Firefly [49] yet others [4]. For instance, Syskill & Webert utilizes a account and learns to rate Webpages of interest by using a Bayesian classifier.

2-1-2. Database Approach

Database approaches to Web mining have focused on techniques for organizing the semi-structured data on the net into more structured collections of resources, and using standard database querying mechanisms and data mining ways to analyze it.

Multilevel Databases:

The main idea behind this approach is that the lowest level of the database contains semi-structured information stored in various Web repositories, such as hypertext documents. At the bigger level(s) meta data or generalizations are extracted from lower levels and organized in structured collections, i. e. relational or object oriented databases. For example, Han, et. al. use a multilayered database where each layer is obtained via generalization and transformation procedures performed on the lower layers. Kholsa, et. al. propose the creation and maintenance of meta-databases at each information providing domain and the use of a worldwide schema for the meta-database. King & Novak propose the incremental integration of a portion of the schema from each information source, rather than relying on a worldwide heterogeneous database schema. The

ARANEUS system extracts relevant information from hypertext documents and integrates these into higher-level derived Web Hypertexts that are generalizations of the idea of database views.

Web Query Systems

Many Web-base query systems and languages utilize standard database query languages such as SQL, structural information about Web documents, and even natural language processing for accommodating the types of queries that are used in World Wide Web searches. We mention a few examples of these Web-base query systems here. W3Q combines structure queries, predicated on the business of hypertext documents, and content queries, predicated on information retrieval techniques. WebLog Logic-based query language for restructuring extracted information from Web information sources. Lorel and UnQL query heterogeneous and semi-structured information on the Web using a labeled graph data model. TSIMMIS extracts data from heterogeneous and semi-structured information sources and correlates them to generate a database representation of the extracted information.

Also We Can Offer!

Ошибка в функции вывода объектов.