Data Mining is ... A concept, algorithm, analysis, purpose and application

The development of information technology brings practical results. But such tasks as finding, analyzing and using information have not yet received an effective quality tool. Analytics and quantitative tools - yes, they really work. But a qualitative revolution in the use of information has not yet happened.

Long before the advent of computer technology, a person needed to process large amounts of information and coped with it to the extent of accumulated experience and available technical capabilities.

The development of knowledge and skills has always responded to real needs and is consistent with current tasks. Data mining is a collective name used to denote a set of methods for discovering previously unknown, non-trivial, practically useful and accessible interpretations of the knowledge needed to make decisions in various fields of human activity.

Man, intelligence, programming

A person always knows what to do in any situation. Ignorance or an unfamiliar situation does not prevent him from making a decision. The objectivity and reasonableness of any human decision can be called into question, but it will be accepted.

The basis of intelligence are: hereditary "mechanism", acquired, active knowledge. Knowledge is used to solve problems that arise in front of a person.

  1. Intelligence is a unique combination of knowledge and skills: capabilities and the foundation for human life and work.
  2. Intelligence is constantly evolving, and human actions affect other people.

Programming is the first attempt to formalize the presentation of data and the process of creating algorithms.

Man, intelligence, programming

Artificial intelligence (AI) is a lost time and resources, but the results of unsuccessful attempts of the last century in the field of AI remained in memory, used in various expert (intelligent) systems and transformed, in particular, into algorithms (rules) and mathematical (logical) analysis Data and Data Mining.

Information and the usual search for a solution

An ordinary library is a repository of knowledge, and the printed word and graphics still have not given way to computer technology. Books on physics, chemistry, theoretical mechanics, design, natural history, philosophy, natural sciences, botany, textbooks, monographs, works of scientists, conference materials, reports on experimental design, etc. are always relevant and reliable.

A library is a lot of the most diverse sources, differing in the form of presentation of the material, origin, structure, content, presentation style, etc.

Library: books, magazines and other print media

Outwardly, everything is visible (readable, accessible) for understanding and use. You can solve any problem, correctly set the task, justify the solution, write an abstract or term paper, select material for a diploma, perform analysis of sources on the topic of a dissertation or a scientific and analytical report.

Any informational task is solvable. With due diligence and skill, an accurate and reliable result will be obtained. In this context, Data Mining is a completely different approach.

In addition to the result, a person receives “active links” to everything that he looked at in the process of achieving the goal. The sources that he used to solve the problem can be referenced and no one will dispute the existence of the source. This is not a guarantee of reliability, but it is a reliable testimony to whom the responsibility for reliability is “written off”. From this point of view, Data Mining is a big doubt in the reliability and no "active" links.

Solving several problems, a person receives results and extends his intellectual potential to many “active links”. If a new task “activates” an existing link, the person will know how to solve it: you won’t need to search again.

An “active link” is a fixed association: how and what to do in a particular case. The human brain automatically remembers everything that seems potentially interesting, useful, or probably necessary in the future. In many ways, this happens on a subconscious level, but as soon as a task arises that can be associated with an “active link”, it instantly pops up in the mind and a solution will be obtained without additional information retrieval. Data mining is always a repetition of a search algorithm and this algorithm does not change.

Basic Search: “Artistic” Tasks

The math library and finding information in it is a relatively weak task. Finding one way or another to solve the integral, construct a matrix, or perform the operation of adding two imaginary numbers is laborious, but simple. You need to sort through a number of books, many of which are written in a specific language, find the right text, study it and get the desired solution.

Over time, the search will become familiar, and the accumulated experience will allow us to navigate library information and other mathematical problems. This is a limited information space for questions and answers. A characteristic feature: such a search for information accumulates knowledge to solve such problems. Information search by a person leaves traces ("active links") in his memory for possible solutions to other problems.

In fiction find the answer to the question: "How did people live in January 1248?" very hard. It is even more difficult to answer the question of what lay on store shelves and how food trade was organized. Even if any writer clearly and directly wrote about this in his novel, if the name of this writer was found, then doubts about the reliability of the data received will remain. Reliability is a critical characteristic of any amount of information. The source, the author and the evidence that exclude the falsity of the result are important.

Objective circumstances of a particular situation

A person sees, hears, feels. Some experts are fluent in a unique feeling - intuition. The statement of the problem requires information, the process of solving the problem is most often accompanied by a refinement of the statement of the problem. This is less of a disaster that comes from the moment information is moved into the bowels of a computer system.

Information in the virtual space

The library and work colleagues are indirect participants in the decision process. The design of the book (source), the graphics in the text, the features of dividing information into headings, footnotes for phrases, an index, a list of primary sources - all evoke associations in a person that indirectly affect the process of solving the problem.

The time and place of solving the problem is essential. A man is so arranged that he involuntarily draws attention to everything that surrounds him in the process of solving the problem. It can be distracting, but it can be stimulating. Data Mining - it will never "understand".

Information in the virtual space

A person has always been interested in only reliable information about an event, phenomenon, subject, algorithm for solving a problem. Man has always imagined how exactly he can achieve the desired goal.

The advent of computers and information systems was supposed to simplify a person’s life, but everything only got complicated. Information migrated into the bowels of computer systems and disappeared from sight. To select the data you need, you need to make the correct algorithm or formulate a query to the database.

Data inside the information system

The question must be correct. Only in this case can you get an answer. But doubts about the validity will remain. In this sense, Data Mining is really “excavation”, it is “information mining”. It is so fashionable to translate this phrase. The Russian version is data mining or data mining technology .

In the works of authoritative specialists, the tasks of Data Mining are indicated as follows:

  • classification;
  • clustering
  • association;
  • sequence;
  • forecasting.

From the point of view of the practice that guides a person in the manual processing of information, all these positions are controversial. In any case, a person performs information processing automatically and does not think about the classification of data, the compilation of thematic groups of objects (clustering), the search for temporary patterns (sequence), or predicting the result.

All these positions in the human mind are represented by active knowledge, which cover more positions and in dynamics use the logic of processing the source data. The subconscious of a person plays an important role, especially when he is an expert in a particular branch of knowledge.

Example: wholesale of computer equipment

The task is simple. There are dozens of suppliers of computer hardware and peripherals. Each has a price in xls format (Excel file), which can be downloaded from the official website of the supplier. It is required to create a web resource that reads Excel files, converts into database tables and allows customers to select the desired products at the lowest prices.

Problems arise immediately. Each supplier offers its own version of the structure and content of the xls-file. You can get the file by downloading it from the supplier’s website, ordering by e-mail or by downloading the link through your personal account, that is, by official registration with the supplier.

Virtual computer store

The solution to the problem (at the very beginning) is technologically simple. Downloading files (source data), a file recognition algorithm is written for each provider and the data is placed in one large table of source data. After all the data has been received, after the mechanism of continuous swapping (daily, weekly or upon the change) of fresh data has been established:

  • assortment change;
  • price change;
  • clarification of the quantity in stock;
  • adjustment of warranty periods, characteristics, etc.

This is where the real problems begin. The thing is that the supplier can write:

  • notebook Acer;
  • notebook Asus;
  • Dell laptop.

We are talking about the same product, but from different manufacturers. How to compare notebook = laptop or how to remove Acer, Asus and Dell from the product line?

For a person, this is not a problem, but how will the algorithm “understand” that Acer, Asus, Dell, Samsung, LG, HP, Sony are trademarks or suppliers? How to compare “printer” and printer, “scanner” and “MFP”, “Xerox” and “MFP”, “headphones” with “headset”, “accessories” with “accessories”?

Building a tree of categories from the source data (source files) is already a problem when you need to put everything on the machine.

Data Sample: Excavations of the “Freshly Filled”

The task of creating a database of suppliers of computer equipment has been solved. A category tree has been built, a common table is functioning with offers from all suppliers.

Typical Data Minig tasks in the context of this example:

  • find a product at the lowest price;
  • choose a product with a minimum shipping cost and price;
  • analysis of goods: characteristics and prices by criteria.

In the real work of a manager using data from several dozen suppliers, there will be many variations of these tasks, and there will be even more real situations.

For example, there is a supplier “A” who sells ASUS VivoBook S15: prepayment, delivery 5 days after the actual receipt of money. There is a supplier “B” of the same product of the same model: payment upon receipt, delivery after the conclusion of the contract during the day, the price is one and a half times higher.

Data Mining data mining, the “digging”, begins. Figurative expressions: “excavation” or “data mining” are synonyms. It is about how to get a basis for a decision.

For suppliers “A” and “B” there is a supply history. Assessment of prepayment in the first case versus payment upon receipt in the second case, taking into account the fact that the supply failure in the second case is 65% higher. The risk of penalties from the client is higher / lower. How and what to determine and what decision to make?

On the other hand: the database was created by a programmer and manager. If the programmer and manager have changed, how to determine the current state of the database and learn how to use it correctly? Will also have to do data mining. Data Mining offers many mathematical and logical methods that do not care what kind of data is being studied. In some cases, this gives the right decision, but not in all.

Moving into virtuality and making sense

Data Mining methods make sense as soon as the information is recorded in the database and has disappeared from the "field of view". Trading computer equipment is an interesting task, but it is just a business. How well organized he is in the company depends on its success.

Climate change on the planet and the weather in a particular city are of interest to all, not just climate professionals. Thousands of sensors take readings of wind, humidity, pressure, data are received from artificial satellites of the Earth, and there is a history of data for years and centuries.

Weather data is not only the solution to the problem: to bring an umbrella with you to work or not. Data Mining technologies are a safe flight of an airliner, stable operation of the highway and a reliable supply of oil products by sea.

The raw data goes to the information system. The tasks of Data Mining are to turn them into a systematic system of tables, establish relationships, identify groups of homogeneous data, and discover patterns.

Climate, weather and raw data

Mathematical and logical methods since the time of quantitative analytics OLAP (On-line Analytical Processing) have shown their practicality. Here, technology allows you to find meaning, and not lose it, as in the example of the sale of computer equipment.

Moreover, in global tasks:

  • transnational business;
  • air traffic management;
  • study of the bowels of the earth or social problems (at the state level);
  • study of the effect of drugs on a living organism;
  • forecasting the consequences of the construction of industrial enterprises, etc.

Data Mine technologies and the translation of "meaningless" data into real data that allow you to make objective decisions is the only possible option.

Human capabilities end where there is a large amount of raw information. Data Mining systems lose their usefulness where you need to see, understand and feel the information.

Reasonable distribution of functions and objectivity

Man and computer should complement each other - this is an axiom. To write a dissertation is a priority for a person, and an information system is an aid. Here, the data available at Data Mining technology is heuristics, rules, algorithms.

Preparing a weather forecast for the week is the priority of the information system. A person manages data, but bases his decisions on the results of system calculations. It combines the methods of Data Mining, classification of specialist data, manual control of application of algorithms, automatic comparison of data from past years, mathematical forecasting and a lot of knowledge and skills of real people involved in the application of the information system.

Man and computer

Probability theory and mathematical statistics are not the most “favorite” and understandable areas of knowledge. Many experts are very far from them, but the techniques developed in these areas give an almost 100% correct result. Using systems based on ideas, methods and algorithms of Data Mining, solutions can be obtained objectively and reliably. Otherwise, a decision is simply impossible.

Pharaohs and mysteries of past centuries

History was periodically rewritten:

  • states - for the sake of their strategic interests;
  • authoritative scientists - for the sake of their subjective beliefs.

To say what is truth and what is false is difficult. Application of Data Mining allows to solve this problem. For example, the technology of building the pyramids was described by chroniclers and studied by scientists in different centuries. Not all materials got on the Internet, not everything is unique here, and many data may not have:

  • the described moment in time;
  • time for writing the description;
  • dates on which the description is based;
  • author (s), opinions (links) taken into account;
  • evidence of objectivity.

In libraries, temples and "unexpected places" you can find manuscripts from different centuries and material evidence of the past.

An interesting goal: to put everything together and unearth the "truth." Feature of the task: information can be obtained from the first description by the chronicler, even during the life of the pharaohs, to the current century, in which this problem is solved by modern methods by many scientists.

The rationale for using Data Mining: manual labor is not possible. The quantities are too large:

  • sources of information;
  • presentation languages;
  • researchers describing the same thing in different ways;
  • dates, events and terms;
  • term correlation problems;
  • .

At the end of the last century, when the next fiasco of the idea of ​​artificial intelligence became obvious not only to the average person, but also to a sophisticated specialist, the idea came up: “to recreate the personality”.

For example, according to the works of Pushkin, Gogol, Chekhov, a certain system of rules, logic of behavior is formed and an information system is created that can answer certain questions as a person would do: Pushkin, Gogol or Chekhov. Theoretically, such a task is interesting, but practically it is extremely difficult to implement.

However, the idea of ​​such a task leads to a very practical thought: "how to create an intelligent search for information." The Internet is a lot of developing resources, a huge database, and this is a great reason to use Data Mining in combination with human logic in a joint development format.

A car and a man in a pair

A machine and a man in a pair are a wonderful task and undoubted success in the field of “information archeology”, high-quality excavations in the data and results that will cast doubt on something, but without any doubt will allow to obtain new knowledge and will be in demand in society.


All Articles