This article describes a Basic loom setup and how to go about scaling the system to handle larger data loads.
Release or Environment
Sophie Standalone Versions 3.4.x-3.7.x
IDM - Identity Management
RBAC - Role-based access control
RDBMS - Relational database
TSDB - Time-series database
Data node - a node running one or more of the following actors:
- Processing actor
- Analysis actor
- Detection actor
- Correlating actor
Processing data-node - a data node running any of the processing, analysis and/or detection actors, bot NOT the correlating actor
Correlating data-node - a data node running only the correlating actor.
The basic setup is appropriate only for evaluation purposes or for small-scale environments.
Using this minimal setup, a single server will run 9 containers as follows:
- Data node
- Web application & API node
- IDM (Identity Management) & RBAC (Role-based access control) node
- TSDB node
- RDBMS node
- Search node
- Search web application node
- Grafana container (self monitoring component)
- Diamond container (optional - metric collection from host & other cntainers on the VM)
The production setup is recommended for production environments.
The production environment will have one or more, load-balanced, processing data nodes, where the data-inputs are shared between these nodes (it is possible to further split the actors).
The data nodes then connect to a single correlating data-nodes.
An additional passive node can be added for every one of the data nodes.
The data nodes rely on three stores to work, where each store-type can and should be clustered.
The search node is based on Elastic 5. It is recommended to run at least three servers to prevent a split-brain scenario.
The RDBMS node is based on Postgresql 9. Even on fairly large deployments, the load on these nodes remains low, as they are only used to store configuration settings and data-models. On the other hand, this data is the 'memory' of Loom and is very important to safeguard.
Postgresql supports active-active mode, which is the recommended setup.
The TSDB is based on Graphite 0.9. To scale Graphite, a cluster can be setup with relays and aggregators.
Processing data-nodes are compute-bound and should therefore be assigned to hosts with mulitple, strong CPUs.
A typical data node will have 8 CPU cores and 16GB RAM, and will be able to handle up to 10K EPS (events per second).
The storage nodes are primarily disk-IO bound. Do not use remote storage (NAS/SAN or other) for any of the storage nodes. Only use locally-attched storage, where SSD will yield immediate throughput benefits.
The data shipped can optionally be queued before being ingested by the system.
For example, data could be written to a Kafka queue, where each data-node will be configured to read some of the partitions.
To achieve high-availability in the search nodes, it is required to run at least 3 nodes, with data replication factor of 2 (every document is stored on two nodes). To scale or to strengthen the availability, one should add more search nods.
Postgresql supports active-active mode, in which failure of the master node makes the slave immediately take its place.
Processing nodes can be set to work in an active-passive mode. While the active processing node will keep processing and reporting detections, the passive node will keep track, waiting for its turn.