💾 👨🏼‍🎓 ⏯️ Distributed systems: definition, features and basic principles 👩‍🏫 🧝🏻 🥑

A distributed system in its simplest definition is a group of computers working together that appear as one for the end user. Machines have a general condition, work simultaneously and can work independently, without affecting the uptime of the entire system. The truth is that managing such systems is a complex topic filled with pitfalls.

System Overview

A distributed system allows for the sharing of resources (including software) connected to the network at the same time.

System distribution examples:

Traditional stack. These databases are stored in the file system of one machine. Whenever the user wants to receive information, he communicates directly with this machine. To distribute this database system, it needs to work on multiple PCs at the same time.
Distributed architecture.

The distributed system allows you to scale horizontally and vertically. For example, the only way to handle more traffic is to upgrade the equipment on which the database is running. This is called vertical scaling. Vertical scaling is good up to a certain limit, after which even the best equipment can’t cope with the required traffic.

Scaling horizontally means adding more computers, rather than updating hardware on one. Scaling up improves productivity to the latest hardware capabilities in distributed systems. These opportunities are insufficient for technology companies with moderate to heavy workloads. The best thing about horizontal scaling is that there are no size restrictions. When performance deteriorates, another machine is simply added, which, in principle, can be done indefinitely.

At the corporate level, a distributed management system often involves performing various steps. In business processes in the most efficient places of the enterprise computer network. For example, in a typical distribution using a three-level model of a distributed data processing system, it is performed on a PC at the user's location, business processing is performed on a remote computer, and the database is accessed and data is processed on a completely different computer that provides centralized access for many business processes. Typically, this type of distributed computing uses a client-server interaction model.

Main goals

The main tasks of a distributed control system include:

Transparency - achieving one system image without hiding the details of location, access, migration, concurrency, failure, moving, saving and resources for users.
Open - simplifies network setup and changes.
Reliability - in comparison with a single control system, it should be reliable, consistent and have a high probability of masking errors.
Performance - Compared to other models, distributed ones provide increased productivity.
Scalability - These distributed control systems must be variable in relation to the area of application, administration or size.

The tasks of the distribution systems include:

Security is a big problem in a distributed environment, especially when using public networks.
Fault tolerance - can be tough when the model is built on the basis of unreliable components.
Coordination and allocation of resources - can be difficult if there are no proper protocols or policies required.

Distributed computing environment

(DCE) is a widely used industry standard supporting such distributed computing. On the Internet, third-party providers offer some generic services that fit into this model.

Grid computing is a computational model with a distributed architecture of a large number of computers related to solving a complex problem. In the computational model of the grid, servers or personal computers perform independent tasks and are loosely connected to each other by the Internet or low-speed networks.

The largest grid computing project is SETI @ home, in which individual computer owners voluntarily perform some of their multitasking processing cycles using their computer for an extraterrestrial intelligence (SETI) search project. This computer problem uses thousands of computers to download and search radio telescope data.

One of the first applications of grid computing was breaking cryptographic code by a group, which is now known as distributed.net. This group also describes its model as distributed computing.

Database scaling

The dissemination of new information from the master to the slave does not occur instantly. In fact, there is a temporary window where you can get outdated information. If this were not the case, the write performance would suffer, as it would have to wait synchronously for the distribution of data from distributed systems. They come with a few compromises.

Using the slave database approach, you can scale read traffic horizontally to some extent. There are many options. But you just need to split the write traffic into several servers, because it cannot handle it. One way is to use a replication strategy with multiple wizards. There, instead of subordinates, there are several main nodes that support reading and writing.

Another method is called sharding. With it, the server is divided into several smaller servers called fragments. These fragments have different records; rules are created about which records fall into which fragment. It is very important to create such a rule that the data is distributed evenly. A possible approach to this is to define ranges according to some recording information.

This fragment key should be selected very carefully, since the load is not always equal to the foundations of arbitrary columns. The only fragment that receives more requests than others is called a hot spot, and they try to prevent its formation. After separation, recalibration data becomes incredibly expensive and can lead to significant downtime.

Database Consensus Algorithms

Databases are difficult to implement in distributed protection systems, since they require that each node coordinate the correct action of an interrupt or commit. This quality is known as consensus, and it is a fundamental problem in the construction of the distribution system. Reaching the type of agreement needed for the transaction fix problem is simple if the processes involved and the network are completely reliable. However, real systems are susceptible to a number of possible failures of the network processes, lost, distorted or duplicated messages.

This creates a problem, and it is not possible to guarantee that the right consensus will be reached within a limited period of time in an untrusted network. In practice, there are algorithms that quickly reach consensus in an untrusted network. Cassandra actually provides easy transactions by using the Paxos algorithm for distributed consensus.

Distributed computing is the key to the influx of big data processing that has been used in recent years. This is a method of breaking up a huge task, for example, a total of 100 billion records, of which not a single computer can do practically anything on its own, into many small tasks that can fit into a single machine. The developer breaks down his huge task into many smaller ones, runs them on many machines in parallel, collects data accordingly, in which case the original problem will be solved.

This approach allows you to scale horizontally - when there is a big task, just add more nodes to the calculation. For many years, these tasks have been performed by the MapReduce programming model, which is associated with an implementation for parallel processing and generation of large data sets using a distributed algorithm on a cluster.

MapReduce is currently somewhat outdated and brings some problems. Other architectures have appeared that solve these problems. Namely, Lambda Architecture for a distributed stream processing system. Achievements in this area brought new tools: Kafka Streams, Apache Spark, Apache Storm, Apache Samza.

File storage and replication systems

Distributed file systems can be thought of as distributed data stores. This is the same as the concept of storing and accessing a large amount of data across the entire cluster of machines, which are a single whole. They usually go hand in hand with Distributed Computing.

For example, Yahoo is known for running HDFS on more than 42,000 nodes to store 600 petabytes of data since 2011. Wikipedia makes the difference that distributed file systems allow access to files using the same interfaces and semantics as local files, rather than through a user API such as the Cassandra query language (CQL).

Hadoop Distributed File System (HDFS) is a system used for computing through the Hadoop infrastructure. Widespread, it is used to store and replicate large files (GB or TB size) on many machines. Its architecture consists mainly of NameNodes and DataNodes.

NameNodes is responsible for storing cluster metadata, for example, which node contains blocks of files. They act as network coordinators, figuring out where it is better to store and copy files, monitoring the status of the system. DataNodes simply store files and execute commands such as file replication, new record, and others.

Unsurprisingly, HDFS is best used with Hadoop for computing, as it provides informational awareness of tasks. Then the specified tasks are launched on the nodes that store data. This allows you to use the data location - optimizes calculations and reduces the amount of traffic over the network.

Interplanetary File System (IPFS) is an exciting new peer-to-peer protocol / network for a distributed file system. Using Blockchain technology, it boasts a completely decentralized architecture without a single owner or point of failure.

IPFS offers a naming system (similar to DNS) called IPNS, and allows users to easily retrieve information. It stores the file through historical version control, just like Git does. This allows you to access all previous states of the file. It is still undergoing heavy development (v0.4 at the time of writing), but has already seen projects interested in creating it (FileCoin).

Messaging system

Messaging systems provide a central place for storing and distributing messages within a common system. They allow you to separate applied logic from direct communication with other systems.

Known scale - Kafka LinkedIn cluster processed 1 trillion messages per day with peaks of 4.5 million messages per second.

Simply put, the messaging platform works as follows:

The message is transmitted from an application that potentially creates it, called a producer, goes to the platform and is read from several applications called consumers.
If you want to save a specific event in several places, for example, creating a user for a database, storage, email service, the messaging platform is the cleanest way to distribute this message.

There are several popular top-notch messaging platforms.

RabbitMQ is a message broker that allows you to fine-tune control of their trajectories using routing rules and other easily customizable parameters. You can call it a “smart” broker, because it has a lot of logic and closely monitors the messages that pass through it. Provides options for AP and CP from CAP.

Kafka is a message broker, which is slightly lower in functionality, since it does not track which messages have been read, and does not allow complex routing logic. This helps achieve amazing performance and represents the greatest promise in this space with the active development of distributed open-source community systems and support for the Confluent team. Kafka is most popular with high-tech companies.

Machine Interaction Applications

This distribution system is a group of computers working together to appear as a separate computer for the end user. These machines have a general condition, work simultaneously and can work independently, without affecting the uptime of the entire system.

If we consider the database as distributed, only if the nodes interact with each other to coordinate their actions. In this case, it is something like an application executing its internal code on a peer-to-peer network, and is classified as a distributed application.

Examples of such applications:

The famous scale is the BitTorrent swarm of 193,000 knots for the episode of Game of Thrones.
The basic register technology of distributed Blockchain systems.

Distributed registers can be considered as an immutable, accessible only for applications database that is replicated, synchronized and shared on all nodes of the distribution network.

The well-known scale - the Ethereum network - had 4.3 million transactions per day on January 4, 2018. They use the Event Sourcing template, which allows you to restore the state of the database at any time.

Blockchain is the current underlying technology used for distributed ledgers and actually marked their beginning. This latest and greatest innovation in distributed space has allowed the creation of the first truly distributed payment protocol - Bitcoin.

Blockchain is a distributed book with an ordered list of all transactions that have ever occurred on its network. Transactions are grouped and stored in blocks. The entire blockchain is essentially a linked list of blocks. These blocks are road to create and are closely related to each other through cryptography. Simply put, each block contains a special hash (which starts with X number of zeros) of the contents of the current block (in the form of a Merkle tree) plus the hash of the previous block. This hash requires more processor power.

Distributed Operating System Examples

System types appear to the user because they are single-user systems. They share their memory, the disk and the user do not have difficulty navigating through the data. The user stores something on his PC, and the file is stored in several places, that is, on attached computers, so that lost data can be easily restored.

Examples of distributed operating systems:

Windows Server 2003
Windows Server 2008
Windows Server 2012
UbuntuLinux (Apache server).

If any computer boots higher, that is, if many requests are exchanged between individual PCs, this is how load balancing occurs. In this case, the requests are distributed to the neighboring PC. If the network is getting more load, then it can be expanded by adding more systems to the network. The network file and folders are synchronized, and naming conventions are used so that there are no errors when retrieving data.

Caching is also used when manipulating data. All computers use the same namespace to name files. But the file system is valid for every computer. If updates appear in the file, it is written to one computer, and the changes are transferred to all computers, so the file looks the same.

/ , . , , , .

. . .

, .
.
, , .

This is briefly about the distribution system, why it is used. Some important things to remember: they are complex and are selected by the ratio of scale and price, and it is more difficult to work with them. These systems are distributed in several categories of storage: computing, file and messaging systems, registers, applications. And all this is only very superficial about a complex information system.

Distributed systems: definition, features and basic principles