### Scalable Cyber-Security for Terabit Cloud Computing

### 2012 IEEE High Performance Extreme Computing

Jordi Ros-Giralt / giralt@reservoir.com



- To be able to do packet analysis at very high-speed rates.
- To be able to intelligently move packets from node to node at very high-speed rates.





| Feature Needed (at very high speed rates): | Application:                     |
|--------------------------------------------|----------------------------------|
| Packet analysis 🔿                          | Cyber security                   |
| Packet forwarding $ ightarrow$             | Cloud computing / load balancing |



 $\rightarrow$  PAAS model — the consumer has control over the deployed cyber-security applications and their configuration and it provides support for any of the four deployment models (private cloud, community cloud, public cloud, hybrid cloud)

# Packet Forwarding and Packet Analysis: Two Faces of the Same Coin (Ex Ante and Ex Post Configurations)

- Optimizing for {Speed + Analysis} != Optimizing for Speed + Optimizing for Analysis
- Optimality is achieved with the joint problem

• aCSC cells -- a network configuration to address the jointly optimized solution:





| Feature:                                                                                                                              | Utility:                                                                                                  |
|---------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
| mPIPE: Programmable packet<br>classification and switching engine with<br>up to 80G bps and 120M packets-per-<br>second of throughput | <ul> <li>High-speed packet switching</li> <li>No kernel involvement</li> </ul>                            |
| TED Queuing (Tail Early Dropping):<br>Intelligent congestion-avoidance packet<br>dropping policy.                                     | <ul> <li>Intelligent packet dropping</li> <li>Designed to make the system operate out of cache</li> </ul> |
| LF <sup>-</sup> Data Structure: Lock-free data<br>structure with low probability of false<br>negative to convey feedback from A to F  | <ul> <li>Zero locks everywhere</li> <li>Minimize memory contention</li> </ul>                             |



- Suppose that the system is so stressed that we need to drop a packet. In the context of CSC, which packet should we drop?
- Approach: leverage the flows' heavy tails



 $I(b_n)$ : degree of cyber security information carried by bit n-th

• TED Queuing designed to (1) exploit heavy tails and (2) to find an optimal trade-off in hierarchical memory architectures



• TED queuing principle:



Lemma 1: Let  $c_{in}$  be the amount of cyber data per unit of time received by the node and let  $\pi_c$  be the amount of cyber data per unit of time that it can actually process:

- If  $\pi_c < c_{in}$ , there exists a value  $\lambda_{ted}^h$  such that for any  $\lambda_{ted} > \lambda_{ted}^h$ , the analyzers in the node will be accessing packets from memory (cache miss).
- For any value of  $\pi_c$ , there exists a value  $\lambda_{ted}^l$  such that for any  $\lambda_{ted} < \lambda_{ted}^l$ , the analyzers in the node will be accessing packets from cache (cache hit).

### **TED Queuing**

• TED queuing algorithm:

Constants:  $\Delta_{ted}$ ,  $t_{ted}$ ,  $\lambda_{ted}^i$ .

- Step 1. Start with  $\lambda_{ted} = \lambda_{ted}^{i}$ , for an arbitrary  $\lambda_{ted}^{i}$
- Step 2. If the cell is operating in memory regime, then keep reducing  $\lambda_{ted}$  by a value  $\Delta_{ted} > 0$  until the cell starts to operate in cache regime.
- Step 3. Wait a period of time  $t_{ted}$  and then increase  $\lambda_{ted}$  by  $\Delta_{ted}$ . After that, return to step 2.





### LF - Data Structure

- Data structure designed to satisfy the following specifications:
  - It must be able to keep at least one flag for each element stored.
  - It can be concurrently written and read by multiple writers and multiple readers but it does not require locks to preserve the correctness of its elements
  - It can tolerate a low probability of false negatives but not false positives
- Algorithm: Initial state: T[e] = NULL for all e such that  $0 \le e < N$ ; Writer (analyzer) algorithm: Upon detecting that c needs to be dropped, do:  $T[h(id(c)) \mod ulo N] = h(id(c));$ Reader (forwarder) algorithm: Upon receiving a packet from connection c, do: If  $T[h(id(c)) \mod ulo N] == h(id(c))$ Drop the packet; Otherwise Forward the packet;

### LF - Data Structure

Lemma 1: LF<sup>-</sup> state correctness. An element in a data structure constructed using the LF<sup>-</sup> algorithm is in a positive state with probability  $p_t$ , in a false negative state with probability  $p_{fn}$ , and in a false positive state with probability  $p_{fp}$ , where  $p_{fp} \ll p_{fn} \ll p_t$  and  $p_t \approx 1$ .

Proof. [Working out the math—see HPEC12 paper]



### Tilera Gx



## Tilera Gx (TILExtreme)



### Tilera Inter-Processor Communication



### Tilera Inter-Processor Communication



#### Reservoir Labs

broa

C

LF-

table

mPPIPE

feedback

### Mcore mapping



### Mcore extensions for Bro

mcore-def.bro

type mcore\_node\_type: enum { NULL, FORWARDER, ANALYZER, FORWARDER\_ANALYZER };
type mcore\_node\_id: count;

```
const mcore_do_shunting = F & tredef;
const mcore_ted_flow_threshold = 2000 & tredef; # in number of bytes within a flow
const mcore_ted_queue_threshold = 50 & tredef; # in % of queue utilization from 0 to 99
const mcore_force = F & tredef;
```

```
const mcore_enabled = F & tredef;
const mcore_is_manager = F & tredef;
```

### Mcore extensions for Bro

#### mcore-x-y-z.bro arch/tilera/mcore.h redef mcore\_node\_map = { void mcore\_get\_ring\_writer(void); = FORWARDER, [0] void mcore get ring reader(void); [1] = FORWARDER, int mcore\_conf\_rings(int num\_analyzers, int num\_forwarders); (...) int mcore\_init\_rings(int NumAnalyzers); [x-1] = FORWARDER, int mcore link init(int \*cpu ranks, [x] = ANALYZER. int num\_forwarders, [x+1] = ANALYZER, int num\_rings, (...) char\* if\_input, [x+y-1] unsigned char mac\_output[][ADDR\_LEN], = ANALYZER, [x+y] = FORWARDER ANALYZER, char if\_output[][MAX\_STRLEN], $[x+y+1] = FORWARDER_ANALYZER,$ int num\_output\_if, (...) int num input if); $[x+y+z-1] = FORWARDER_ANALYZER_{i}$ void \* mcore\_alloc\_shmem(unsigned int size); void mcore\_free\_shmem(void \* mem, unsigned int size);

### Performance (Tilepro)



### Performance (Tilepro)



### Performance (Tilepro)



### Performance (Tile-GX)

|              |              | # Analyzers | Selective Capacity (events/sec) |            | Converding throughout             |
|--------------|--------------|-------------|---------------------------------|------------|-----------------------------------|
|              | # Forwarders |             | Per Tile                        | Total      | - Forwarding throughput<br>(Gbps) |
| 1 Forwarder  | 1            | 1           | 98.920041                       | 98.920041  | 9.8                               |
|              | 1            | 2           | 167.699023                      | 335.398046 | 9.8                               |
|              | 1            | 3           | 194.169252                      | 582.507756 | 9.8                               |
|              | 1            | 4           | 152.70955                       | 610.8382   | 9.8                               |
|              | 1            | 5           | 122.405928                      | 612.02964  | 9.8                               |
|              | 1            | 6           | 93.275352                       | 559.652112 | 9.7                               |
|              | 1            | 7           | 66.541857                       | 465.792999 | 9.7                               |
|              | 1            | 8           | 47.263248                       | 378.105984 | 9.7                               |
| 2 Forwarders | 2            | 2           | 164.556612                      | 329.113224 | 9.8                               |
|              | 2            | 4           | 148.848772                      | 595.395088 | 9.8                               |
|              | 2            | 8           | 91.044497                       | 728.355976 | 9.8                               |
| 4 Forwarders | 4            | 4           | 168.595005                      | 674.38002  | 9.8                               |
|              | 4            | 8           | 94.797063                       | 758.376504 | 9.8                               |
| 8 Forwarders | 8            | 8           | 96.312077                       | 770.496616 | 9.8                               |



- Live demo at the SCinet Research Sandbox this coming November  $\rightarrow$  target: 4 x Tile Gx-36 from 10 to 80 Gbps.
- Tilera Gx roadmap toward 100 core processors. Leverage this roadmap and other development in the manycore space toward scaling to terabit computing.
- Looking for testbeds that may be interested in testing Mcore technology.

### References

- S. Campbell, J. Lee, "Intrusion Detection at 100G", The International Conference for High Performance Computing, Networking, Storage, and Analysis, November 14, 2011
- "Department of Energy Builds National 100GigE Research Net," Press Release, July 13, 2011.
- M. Vallentin, R. Sommer, J. Lee, C. Leres, V. Paxson, B. Tierney, "The NIDS Cluster: Scalable, Stateful Network Intrusion Detection on Commodity Hardware," Recent Advances in Intrusion Detection, 2007
- National Institute of Standards and Technologies, "The NIST Definition of Cloud Computing," Special Publication 800-145, September 2011.
- J. Ros-Giralt, P. Szilagyi, R. Lethin, "Scalable Cyber-Security for Terabit Cloud Computing (Extended Version)," Reservoir Labs Technical Report, May 2012.
- V. Paxson, "Empirically derived analytic models of wide-area TCP connections," IEEE/ACM Transactions on Networking, 2(4):316–336, August 1994.
- Jose Gonzalez, Vern Paxson, Nicholas Weaver, "Shunting: A Hardware /Software Architecture for Flexible, High Performance Network Intrusion Prevention," ACM Conference on Computer and Communications Security, November 2007.
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein, "Introduction to Algorithms," The MIT Press; 3rd edition, 2009.
- J. Ros-Giralt, P. Szilagyi, "A Lockless Hash Table With Low False Negatives," mathematical note, Reservoir Labs, April 2012.
- Floyd, Sally; Jacobson, Van. "Random Early Detection (RED) gateways for Congestion Avoidance". IEEE/ACM Transactions on Networking, 1(4): 397–413, August 1993.
- Vern Paxson, "Bro: A System for Detecting Network Intruders in Real-Time," Computer Networks, December 1999.
- M. Berezecki, E. Frachtenberg, M. Paleczny, K. Steele, "Many-core key-value store," igcc, pp.1-8, 2011 International Green Computing Conference and Workshops, 2011.
- "Tilera Unveils the Ultimate Cloud Computing Processor," Tilera Press Release, June 2011.
- Carl Ramey, "TILE-Gx100 ManyCore Processor: Acceleration Interfaces and Architecture," Tilera Corporation, August 2011.

