What is parsing: purpose and logic

Parsing has become especially popular recently, but his idea has appeared and has been used for a long time. Processing large amounts of data, in which the source is not formalized, and the algorithm is strictly fixed, relevant and popular task.

what is parsing

What is parsing? The concept is habitually associated with the Internet, but automation of information processing processes is rooted in local programming. Distributed information processing would not be so effective if it had not been preceded by a long period of theory and practice of text analysis.

Understanding Parsing

A parsing program can be executed in any programming language. The data source is:

  • The Internet;
  • specific list of web resources;
  • gateway to the local network;
  • database;
  • scanned material and more.

One of the good tools for solving parsing tasks is server programming in PHP, XML, CSS, HTML, and other similar data presentation formats are the most popular and frequent sources.

php xml

Parsing result, for example:

  • dynamics of the foreign exchange market;
  • stock quotes;
  • climatic data;
  • software updates;
  • news and events in the world and more.

The scope defines and fills the concept with concrete meaning, allows you to understand what parsing is.

Influence of the task area on the parsing algorithm

The work of information systems in the field of exchange trading is significantly different from the work of the warehouse accounting system. In the first case, there is a strictly concrete, rarely changed spectrum of resources and a fixed algorithm for obtaining the necessary data. In the second case, pattern recognition, conversion of graphic information into text is required.

Obviously, what is parsing in these two cases. It is significantly different:

  • according to the understanding of the original given;
  • according to the algorithm of its processing.

The collection of climate information cannot be guided by a strictly defined range of sources. In this subject area, not only the number of options for obtaining initial information changes, but also the likely change in the parsing logic.

Many financial sites or geographical resources (climate, weather, forecasts) offer visitors not their own pages, but the ability to download an updated amount of information. There is a task - to do a parsing of the file. At the same time, it is often not enough to take new lines that were not in previous downloads.

Often the newly downloaded file contains changes throughout its contents. When writing effective parsing programs, this point should not be ruled out even in cases where the scope seems static.

parsing program

Parsing Logic Analysis

In most cases, what parsing is, is determined by the programmer. This may also be affected by the customer. Often the ideas and algorithms of the developer, especially at the company level - this is a serious know-how and trade secret of the author.

Watching the work of search engines, which at one time parsed the expanses of the Internet, collecting information; who constantly clarify what theyโ€™ve collected, wanting to maintain their information arsenal on a modern and updated level, you understand that there is always a correspondence:

  • source (key request);
  • search results (response to a request).

This is a classic parsing formula that underlies a unique foundation. It is difficult to solve the parsing algorithm, but analyzing the totality of keywords and comparing the results of search results, you can determine the appropriate use of certain tools.

The main criterion for any information process: the compliance of the task with the obtained solution. A good addition to the solution is its relevance. Not every web resource reports on its pages the date of updating information, but if we compare the previous results of parsing with the current ones, we can draw conclusions on how much we update this resource.

file parsing

Parsing Boundary Dynamics

What is parsing is quite understandable when there is a goal to collect the necessary information. There are criteria, there is a range of data sources and a goal. There may be other clarifications of the conditions of the problem and ideas about the desired solution.

If you use PHP in XML, CSS, HTML, then there is no problem. These data description languages โ€‹โ€‹are strictly formal and, with the correct use of regular expressions, allow you to have a reliable result.

If the creator of the resource, which is parsed, changes the structure of the page, adds descriptions or new tags, then the desired information no longer falls under the written regular expression, and the result will include inaccurate selection.

You can expand the boundaries of parsing to capture more information, and then refine the received, or narrow the boundaries of the search and get a minimum of information. In the first case, you have to go to the additional cost of filtering the sample, in the second case, it is easy to miss something important.

The best solution is to formalize the target information not only in terms of its expected content and tagged environment, but in the context of the first and the dynamics of the second. Accumulating experience in the tagging environment of the required content, it is possible with a fairly high degree of certainty to determine the boundaries of the position of the sought, not to have a large sample of the excess and not to lose significant.


All Articles