2008年2月25日 星期一

2/26作業

網際網路服務

  • 寫一份xml檔案,使用DTD的規範。
  • 下載兩隻老虎的XML檔案

資訊與系統安全專題

  • 讀paper吧
  • CHAP作業5

Homework 01 自己翻譯版

簡介:在這一個文章中,我們分析不同形式的統計實驗使用在於不規則的入侵警示的關聯性上。我們顯示Granger Causality Test,一些建議可以應用在擴大不規則入侵範圍、強烈的依賴在一個好的參數選擇,被證明是靈敏度高且難以偵測的。我們建議一個不同的方法,依據一套簡單的統計測試,而且我們證明我們的工作標準,以及對一個簡化的相關任務,而不需要複雜的配置參數。

1.Introduction
在入侵偵測中,最有挑戰性的任務之一就是建立一個事件上一致的觀察,從多種的監聽裝置上將兩種以上的警報混和。這種混和程序的警報可被定義為聚合串流警報的關係。此集合為警報的群組都被結束在相同的時間,而且被建立擁有類似的特點,此集合融合了不同觀點在同樣的事件中。警報關聯必須用於邏輯上的連結警報認知。「關係」並不是必要的是指「統計上的關係」,但是某些方法依據統計上的關係有時候被使用在顯示這些關係。

警報的融合是比較複雜的,考慮到異常檢測系統,因為沒有任何資料的類型或分類的觀察攻擊,可以使用在融合算法。在最近相關性的文獻中,大部分的演算法建議使用這個資訊,因此不適用於純粹基於異常的入侵偵測系統。

在這一個實驗中,我們探究統計因果試驗的使用,被建議用在IDS警報的關聯性上,而且也可以良好的應用在異常的IDS警報。我們把焦點專注在Granger Causality Test (GCT)的使用。而且也顯示他的效能強烈的依賴於選擇一個好的參數,可以證明是靈敏的且困難的估計。我們重新定義因果問題在相似的統計測試的術語上,而且試著去驗證他。

2.Problem statement and state of the art
警報混和程序輸出的需求是密切的,高層次的觀察在網路(通常是大而且複雜的)上發生了什麼事情,在這一次的研究中我們使用他。

圖一、圖示報警熔合術語用在這方面的工作

一個些許修改過的科技建議Alerts streams在不同的IDS系統收集常態化且集合之後,警報關聯性是相當最後的一個步驟。在「1」的階段,「融合」被應用在我們稱為「聚合」的階段。然而我們使用前者表示整個過程。

我們建議一個模糊的依賴時間的聚合技巧顯示他產生好的效果在於錯誤的時期確實的減少。在這裡,我們多點聚焦在有挑戰性的關聯性狀態。有效的且一般相關性的演算法是很困難的去設計,特別假如目標是複雜攻擊腳本的重建。

一個關於警報關聯的技巧基於狀態過渡圖在Fig.3顯示出來。有限狀態自動控制的使用能夠用於複雜的方案描述,但是需要已知的腳本簽章。同時前者也不適合用於完全不規則其無法辨別不同形式的事件之間差別的檢測器。稍微接近一點,以相似的長度和毛病但是不同的行為,被嘗試在先-瞭解條件格式的攻擊,有時沿著時間間隔的標準。他可能是採集方案的規則直接來自資料,也從被監督或是為被監督的形式。這兩種方法使用類似於他們的規則的警報分類。

所有這些技術,將致力於異常檢測系統,因為他們信賴警報代號或工作的分類。此種演算法的最佳例子不需要這樣的功能是基於時間序列分析與建模。例如Fig.8是基於時間序列的結構靠著計算警報進入採樣間隔的數量。開發的趨勢和週期性的可移除的演算法允許過濾可預測的物件,留下真實的警報作為輸出。比起一個關聯性的方法,雖然這是一個誤判和降噪的方法。這關聯性的辦法研究在Fig.9而且基於GCT也不需要先備的知識。而且吸引我們的注意力正如那少許可以留下的針對不規則偵測警報的方案關係在早期的文獻中。我們也將描述且更深入的分析這一個方法在Ch.4

3 Problems in the evaluation of alert correlation systems
警報混合系統的估計值仍然受限於一些建議,是實際的且富理論的挑戰可增長[2]。除此之外,可靠資料的來源缺乏的共同問題為了基準程序衝突嚴厲的也在理想的關聯系統評估,我們需要主題及網路資料組,完全的標示有複雜的攻擊腳本詳細的描述。這些資料應該可以隨意的變動對於科學的社會。這些需要的條件超出現實社會。

這類型唯一的資料組有效利用是單一的藉著DARPA(IDEVAL 資料組)。當然,自從這資料組被建立來估計入侵偵測感測器不是用來估算關聯性的工具,他並不包含感測器警報。這警報必須被在資料上連續變化的感測器所建立。有1999個資料組[10]我們拿來使用在這一個研究有許多已知的缺陷。首先他是顯然且無望的過時。而且,為數的缺陷被偵測出而且在網路追蹤上受到批評[11,12]。最近,我們分析了主機式系統呼叫路徑,而且顯示出[13,14]他充斥著許多的問題。

這些基本的缺陷並不是極端的對於這一個研究充滿危險,自從攻擊效果的散播(從網路到主機)藉著任何IDEVAL的已知缺陷是不受影響的,而且事實上我們觀察到它實實在在的如此表現。問題是事實上入侵腳本太簡單而且直接。此外,許多的攻擊是無法被察覺的在主機及網路的資料(如此,建立完整的關聯消失的點上)。現在,網路及攻擊者較為高技巧性且攻擊腳本比起1999以來是越來越複雜,操作許多層面的網路以及應用攻擊。

這一個研究,我們緊密的採用DEFCON 9 CTF轉出和DARPA Cyber Panel Correlation Technology Validation (CTV) [15]資料組關於警報關聯模型的計算。這些已成型的資料組並不是階層化而且不包含任何背景通信,所以在事實上(就像作者自己認為)並不能夠用於適當的計算,但是可以作定量的分析。相反地,DARPA CTV的成果,在2002被提出建立一個複雜的實驗的網路,沿著背景通信和一串的攻擊腳本。這警報藉著變化的感測器產生,當這些攻擊被收集而且給計算關聯工具。不幸的,資料組是無法利用於更進一步的實驗。

依照前面所有的理由,在我們的試驗中我們將使用IDEVAL資料組經過簡化的。我們試著去把從單一主機的IDS感測器(HIDS)的警報流與對應的從網路式的IDS(NIDS)的警報產生關聯,後者監視著整個網路。直到最後,我們執行兩個不規則的IDS模型(亦描述在13,14,16)在整個IDEVAL資料組。我們執行NIDS模型在tcpdump資料而且收集128個違反主機pascal.eyrie.af.mil [17]的攻擊警報。這個NIDS也被1009個警報與其他有關聯的主機所建立。使用HIDS模型,我們建立了1070個警報從pascal.eyrie.af.mil主機所得到。針對這些警報,NIDS的是能夠探測近66 %的攻擊與小於0.03 %的錯誤結果; 而HIDS更是呈現了檢出率為98 %和1.7 %的錯誤結果。

在隨後,我們使用這種以下的速記符號:Net 是 substream 所有警報藉著NIDS所產生的。 HostP 是所有警報藉著HIDS安裝在pascal.eyrie.af.mil 所產生的 substream ,而 NetP 被視為所有的警報藉著NIDS(與 帕斯卡爾 作為標靶)所產生的,最後NetP = Net \ NetP 表示所有警報藉著NIDS(同所有,但 帕斯卡爾 作為標靶)所產生的。

4 The Granger Causality Test
在[9]Qin and Lee建議一個有趣的演算法關於警報關聯性似乎是適當的也能夠適用於不規則的警報。擁有相同特質的警報被組合為一個時間依序專案的集合,屬於相同的模式。(依據[8]的概念。)隨後,一連串頻繁的次數被建立,使用一個大小固定,可調整的視窗,其結果是時間序列對於每一個警報的集合。這一個樣板稍後被應用在 GCT[18],一個統計上有搜尋能力的假說實驗在於因果關係於兩個時間序列中,當他們是線性的發生,常態的過程。GCT提供一個隨機的測量標準,叫做「Granger Causality Index」(GCI),多少的歷史在一個時間序列(假定的原因)需要說明另外一個的發展(假定的結果或是目標)。GCT是基於兩種模式的估算:其一是一個自動回歸模式(AR),在這一個模式,未來目標的樣板是被模組化為有影響力的靠著之前目標本身的例子。其二是自回歸移動平均線外源性(ARMAX)模式,也是進入計算假定的發生時間序列如同外生的元件。一個統計學上的F-試驗建立在模式之上估算錯誤選擇出最適當的模式,假使ARMAX選擇了最適合的,那麼源頭有效的影響著目標。

在沒有監督的因果關係的認證事件被完成靠著反覆的上述程序針對每一個成雙的時間序列。這一個方法的益處是他不需要先備知識(縱使他是使用強力的攻擊數值,假使可以,在一個可選的次序。)然而,在先前的工作中,我們也顯示了GCT轉弱雖然認知上是有意義的關係在 IDEVAL攻擊之間。

我們測試了GCT的感度針對於兩種參數,一個樣本時間w和一個延遲的時間p(是AR的排序)在我們的樣本中。

圖三:最佳的時間延遲p藉著AIC標準非常多重的時間給予。

經過實驗以後,期望得到的結果是NetP、HostP、和HostP6 NetP(表示「因果」當6是它的否定),在「9」,樣本時間是隨意的設定為W=60,當P的選擇不足為證時。然而,我們的實驗顯示這些參數的選擇可以強烈的影響實驗的結果。在圖2,(1-a/b)我們繪製P值和實驗的GCI在每一個不同的P(W=60秒)。特別的是,虛線符合NetP(K)HostP (K)的實驗,而且實線符合HostP(K)NetP(K),我們重新回想假使P值低於有意義的等級,無效的假設是被拒絕的。要注意到不同的P的選擇如何引導不確定的甚至可能的結果。例如,...(數值)

有一個可能的解釋如下:為了計算正確的誤差,GCT是有意義的只要線性回歸模式都是最佳的。假使我們使用Akaike訊息Criterion去計算最佳時間延遲P在所有不同的資料窗中,我們發現P不停的廣泛的變化,正如Fig.3所展示的。事實顯示沒有一個固定的最佳選擇的P值,同時實驗的結果顯著的依賴他,使我們懷疑GCT是可行的功能一般的警訊集合。W的選擇看起來相同重要甚至是較困難去完成,除了猜測以外。

當然,我們的實驗並不是最後的確證,IDEVAL警報集合或許會比較簡單,不足以顯示因果關係,另一方面,儘管那是不太可能的,說明了,GCT是不太適合不規則的偵測警訊。事實上,在[9],被測試過在於錯誤的警報。不過事實上也有理論上的理由去懷疑GCT的應用可以固定、好的結果。第一,實驗是漸進的w.r.t.p表示結果信賴度隨著P值增加而降低,因為自由程度的損害。第二,奠基在強力線性的假定在自動回歸符合步驟的模式,其也強力的依賴觀察的現象。在同樣的方法,不變的模式假設並不長存。

5 Modeling alerts as stochastic processes
為了代替解讀警報串流因著時間序列(按建議由GCT-based方法) ,我們提出要改變的焦點,用一個隨機模型,其中警報被描述為隨著時間(隨機)活動。 這項建議可以被看作是一個正式的延伸的做法,在[1]已經介紹過,其中相關的警報,如果他們被發射由不同IDS中的一個"微不足道"的時間內,在那裡, "微不足道"的定義是一個簡潔、固定標準。

為了簡潔明了,我們再次說明我們的技術在簡單的單一HIDS和單一NIDS的監測整個網路。這一個概念,雖然,可以很容易普及在顧及兩個以上的警報串流中,藉著一對一對的評估他們。每個警報,我們有三個基本信息:一個時間標記,一個目標的主機(固定的,在HIDS的環境,對於主機本身) ,以及設計感應器(在我們的情況下,一個二進制值)。

我們重用的情況和數據,我們已經呈現在第4段。使用不需解釋的標示,我們也定義了以下隨機變數:TNetp 是網路警報在NetP到達的時間( 同樣的定義用於TnetO , THostP);Netp ( NetO ) ,是在一個具體的基於網絡的警報關於pascal (非pascal ),以及相應的基於主機的警報所造成延遲(造成藉著傳輸,處理和不同偵測間隔)。每一個T()的實際值不過是一套從對應的警報串流所獲得的時間標記。我們有理由假定" NetP和TNetP是隨機獨立的(同時假定NetO和TnetO)。


在一個理想的相關框架與兩個同樣完美有以100%的IDS和0%FDR的IDS中,如果兩個警報串流是有關連性的(例如,他們所代表同樣攻擊事件的獨立偵測,由不同IDSes [1]),他們也是同時接近的。NetP和HostP應明顯舉出一對串流的例子。顯然的,在現實世界中,一些警報,將會漏掉(因為錯誤結果,或者僅僅是因為部分的攻擊是無法檢測的,只有特定類型的探測器),而且相關的警報之間的距離將因此有較高的變異性。為了去計算這些,我們就可以"切斷"那些在其他時間序列過於遠離對應警報的警報,假定他們在我們的情況中是單一的,在原有數據中知道單一攻擊沒有持續超過400秒,在這一點上,我們初步定分界門檻。


根據給定的工作假設和建議的隨機模型,我們可以正式的界定問題為一組兩個統計假設的檢驗:

Let { t i,k } be the observed timestamps of T i 8 i 2 { HostP,NetP,NetO } , 第一次實驗的意義是直截了當的:一隨機且大量的時間中,NetP,發生的一連串警報, HostP,k,被網路警報提前,NetP ,k ,如果沒有統計上的顯著數量的事件發生時,實驗結果是警報串流TNetP 與THostP 不相關,在這種情況下,我們有足夠的統計證據,而拒絕H1和不接受任一。同樣的,拒絕第二次試驗的空假設,表示NetO警報串流(對於所有主機,除了pascal ),是與關於pascal的警報串流有關聯性。

值得注意的是,以上兩個試驗有很強的相互關聯:在一個理想的關聯架構,它不會發生的情況,就是以下兩個 " NetP和HostP是有關聯性的 " 且 " NetO和Hostp是相關的 " :這將意味著該網路活動關於所有主機除了pascal(會引起NetO)必須與pascal(提高HostP)的主機活動一同執行與同一量級的 NetP,這是一個直接的相互矛盾的結論。因此,第二次實驗是作為一種"穩健"的標準。

從我們的警報,我們可以簡單地取得計算出NetP的樣本,為每個在NetP中的值,其值在 HostP是最接近的,但更巨大的(支持上述定義的門檻),我們可以做同樣的NetO,使用警報在NetO和HostP。

下一步涉及到的選擇分佈的隨機變量,我們在上面已定義。典型的分佈用於模擬隨機定時事件的發生決定於指數的組合(概率密度函數( PDF格式)) [20]。尤其是,我們決定,使他們與Gamma PDFs符合,因為我們的實驗結果表明,這種分佈是一個很好的選擇,無論是NetP和NetO。

PDF的估計NetPf P := f " NetP , and " NetO , f O := f " NetO , 是用著名的最相似的(ML)技術[21]作為落實在GNU R軟體包:這些結果被歸納在Fig。 4. f P and f O are approximated by Gamma[3 . 0606 , 0 . 0178] and Gamma[1 . 6301 , 0 . 0105], respectively (standard errors on parameters are 0 . 7080, 0 . 0045 for f P and 0 . 1288, 0 . 009 for f O ).從現在起,某一給予的密度 f 的估計會被標示 ? F 。

圖4顯示統計圖與估計密度(紅色虛線)和分位區( QQ的區) , 是關於F和FP 。We recall that QQ plots are an intuitive graphical “tool” for comparing data distributions by plotting the quantile of the first distribution against the quantile of the other one. 我們還記得QQ的區塊是一個直觀的圖形"工具"的資料進行比較資料,藉著第一分佈的分位點對照其餘的分為點。

考慮到樣本大小的 " ( ) 是靠近40 , QQ的區塊,實證證實了我們的直覺:事實上, F 0 及 ? F P 都能夠說明真實的數據,無可避免的,但微不足道有估計誤差。即使? f p 和 ? f O都是gamma-圖形,但必須注意,它們在其參數顯著不同;這是一個非常重要的結果,因為它允許設立一個適當的標準來決定要或不要 " NetP " NetO 是被同樣的標準所產生。

鑑於上述估計的,更精確和穩健的假說測試可於現在設定,試驗1和2可以映射到兩面的的Kolmogorov- Smirnov tests [20],取得了同樣的結果在設定的條件。

這象徵代表"具有相同的分佈"。使用估計是有利的。我們重新回想,KS-test是一個非參數檢驗比較樣本(或一個PDF )對比於PDF(或樣品),檢查與對方有多少不同。這種測試可以被執行,例如,隨著KS-test( )(一個GNU R的原始程序):P值的結果分別為IDEVAL 1999 are 0 . 83 and 0 . 03, respectively.1999年的0 83和0 。 03 。明顯的,有一個顯著的統計數字顯示同意虛無假設的試驗3。它似乎是ML估計是能夠正確地符合Gamma PDF 在 F P(給予NetP樣本),在其中雙重檢查我們對分佈的直覺。這其中同樣的並不待在F 0 :事實上,他也不能。

從NetO以Gamma PDF正確的估計。低P值為了測試4證實了NetO延遲的分配,比起NetP是完全不同的。因此,我們的標準不但承認嘈雜且延誤的關聯性在警報串流之間,如果他們存在 ,它也有能力檢測如果這種相關性並不成立。

我們也考驗了我們警報產生的的技術藉著NIDS/HIDS執行在集合的1998(限於我們對於一個星期頭四天的分析),以交叉驗證了上述結果。們準備了並處理數據以同樣的程序如我們上文所述針對1999筆數據。從幾乎相同比例的主機/警報網對任何pascal或其他主機,有ML估計已計算出兩個Gamma密度在圖5。f P 和 F O接近Gamma(3 . 5127 , 0 . 1478) and Gamma(1 . 3747 , 0 . 0618),分別為(standard errors on estimated parameters are 1 . 3173, 0 . 0596 for f P and 0 . 1265, 0 . 0068 for f O ))。這些參數與我們所估計針對收集到的1999比數據都非常相似。此外,當P值為0.51和0.09,兩者的KS測試證實同一統計不一致,從我們觀察的1999筆數據。上述計算結果顯示,通過解譯警報串流作為隨機程序,有幾個(隨機)的不同點在網路到主機的延遲之間屬於同一網路-主機的攻擊session,而且網路到主機的延遲屬於不同session。利用這些不同處,我們可以找出串流之間的關係在一個不受監督的方式,而不需要事先任何參數。


6 Conclusions 6結論


在這篇文章中我們分析了使用不同類型的統計測試,針對異常檢測警報的關聯性,有一個問題是目前很少或根本沒有解決方案。其中的一些相關建議,可以將它們應用到異常檢測,是使用一種Granger Causality Test(GCT)。經過討論一種可能的測試方法之後,我們觀察到收集到的數據集

傳統上使用的評估方式有不同的缺點,我們藉著使用一個較簡單的情景對比數據作部份的處理,只調查針對特定主機之主機式警報的串流之間的連結,以及相應的警報串流,從一個網路式的探測器。

我們研究使用一種GCT中的建議,在早期著作,它也顯示,依賴於選擇非顯而易見的配置參數,這大大影響最終的結果。我們還表現出,其中的參數(該命令的模式)是絕對關鍵的,但不能針對給定的系統單獨估計。代替GCT,我們提出了一個簡單的警報產生器的統計模型,並形容警報串流和時間標記作為隨機變量,並顯示統計測試可以用來創造一個合理的標準,區分正相關和不相關的串流。我們證明了我們的標準化作業已完備使用在簡化相關我們用來測試的任務且不需要複雜的設定參數。

這是一項探究的工作,並需要進一步用於實際的研究,更多的序列數據,以及進一步我們所提出的完善的測試和標準,肯定是必要的。另一種這項工作可能的延伸是這些條件可以被使用在建立異常和misuse-based警報的關聯性的研究,以建立現有侵入偵測的範例。

Homework 01 原文

On the Use of Different Statistical Tests for Alert Correlation -- Short Paper
Maggi, Federico / Zanero, Stefano

Abstract. In this paper we analyze the use of different types of statistical tests for the correlation of anomaly detection alerts. We show that the Granger Causality Test, one of the few proposals that can be extended to the anomaly detection domain, strongly depends on good choices of a parameter which proves to be both sensitive and difficult to estimate. We propose a different approach based on a set of simpler statistical tests, and we prove that our criteria work well on a simplified correlation task, without requiring complex configuration parameters.


1 Introduction


One of the most challenging tasks in intrusion detection is to create a unified vision of the events, fusing together alerts from heterogeneous monitoring devices. This alert fusion process can be defined as the correlation of aggregated streams of alerts. Aggregation is the grouping of alerts that both are close in time and have similar features; it fuses together different “views” of the same event. Alert correlation has to do with the recognition of logically linked alerts. “Correlation” does not necessarily imply “statistical correlation”, but statistical correlation based methods are sometimes used to reveal these relationships.


Alert fusion is more complex when taking into account anomaly detection systems, because no information on the type or classification of the observed attack is available to the fusion algorithms. Most of the algorithms proposed in the current literature on correlation make use of such information, and are therefore inapplicable to purely anomaly based intrusion detection systems.


In this work, we explore the use of statistical causality tests, which have been proposed for the correlation of IDS alerts, and which could be applied to anomaly based IDS as well. We focus on the use of Granger Causality Test (GCT), and show that its performance strongly depends on a good choice of a parameter which proves to be sensitive and difficult to estimate. We redefine the causality problem in terms of a simpler statistical test, and experimentally validate it.


2 Problem statement and state of the art


The desired output of an alert fusion process is a compact, high-level view of what is happening on a (usually large and complex) network. In this work we use


Fig. 1. A diagram illustrating alert fusion terminology as used in this work.



a slightly modified version of the terminology proposed in [1]. Alerts streams are collected from different IDS sources, normalized and aggregated; alert correlation is the very final step of the process. In [1] the term “fusion” is used for the phase we name “aggregation”, whereas we use the former to denote the whole process.

Fig. 1 summarizes the terminology.


In [2] we propose a fuzzy time-based aggregation technique, showing that it yields good performance in terms of false positive reduction. Here, we focus on the more challenging correlation phase. Effective and generic correlation algorithms are difficult to design, especially if the objective is the reconstruction of complex attack scenarios.


A technique for alert correlation based on state-transition graphs is shown in [3]. The use of finite state automata enables for complex scenario descriptions, but it requires known scenarios signatures. It is also unsuitable for pure anomaly detectors which cannot differentiate among different types of events. Similar approaches, with similar strengths and shortcomings but different formalisms, have been tried with the specification of pre- and post-conditions of the attacks [4], sometimes along with time-distance criteria [5]. It is possible to mine scenario rules directly from data, either in a supervised [6] or unsupervised [7] fashion. Both approaches use alert classifications as part of their rules.


None of these techniques would work for anomaly detection systems, as they rely on alert names or classification to work. The best examples of algorithms that do not require such features are based on time-series analysis and modeling. For instance, [8] is based on the construction of time-series by counting the number of alerts occurring into sampling intervals; the exploitation of trend and periodicity removal algorithms allows to filter out predictable components, leaving real alerts only as the output. More than a correlation approach, this is a false-positive and noise-suppression approach, though. The correlation approach investigated in [9] and based on the GCT also does not require prior knowledge, and it drew our attention as one of the few viable proposal for anomaly detection alert correlation in earlier literature. We will describe and analyze this approach in detail in Section 4.


3 Problems in the evaluation of alert correlation systems


Evaluation techniques for alert fusion systems are still limited to a few proposals, and practically and theoretically challenging to develop [2]. Additionally, the common problem of the lack of reliable sources of data for benchmarking impacts heavily also on the evaluation of correlation systems. Ideally, we need both host and network datasets, fully labeled, with complex attack scenarios described in detail. These data should be freely available to the scientific community. These requirements rule out real-world dumps.


The only datasets of this kind effectively available are the ones by DARPA (IDEVAL datasets). Of course, since this data set was created to evaluate IDS sensors and not to assess correlation tools, it does not include sensor alerts. The alerts have to be generated by running various sensors on the data. The 1999 dataset [10], which we used for this work, has many known shortcomings. Firstly, it is evidently and hopelessly outdated. Moreover, a number of flaws have been detected and criticized in the network traces [11,12]. More recently, we analyzed the host-based system call traces, and showed [13, 14] that they are ridden with problems as well.


For this work these basic flaws are not extremely dangerous, since the propagation of attack effects (from network to hosts) is not affected by any of the known flaws of IDEVAL, and in fact we observed it to be quite realistically present. What could be a problem is the fact that intrusion scenarios are too simple and extremely straightforward. Additionally, many attacks are not detectable in both network and host data (thus making the whole point of correlation disappear). Nowadays, networks and attackers are more sophisticated and attack scenarios are much more complex than in 1999, operating at various layers of the network and application stack.


The work we analyze closely in the following [9] uses both the DEFCON 9 CTF dumps and the DARPA Cyber Panel Correlation Technology Validation (CTV) [15] datasets for the evaluation of an alert correlation prototype. The former dataset is not labeled and does not contain any background traffic, so in fact (as the authors themselves recognize) it cannot be used for a proper evaluation, but just for qualitative analysis. On the contrary, the DARPA CTV effort, carried out in 2002, created a complex testbed network, along with background traffic and a set of attack scenarios. The alerts produced by various sensors during these attacks were collected and given as an input to the evaluated correlation tools. Unfortunately, this dataset is not available for further experimentation.


For all the previous reasons, in our testing we will use the IDEVAL dataset with the following simplification: we will just try to correlate the stream of alerts coming from a single host-based IDS (HIDS) sensor with the corresponding alerts from a single network-based IDS (NIDS), which is monitoring the whole network. To this end, we ran two anomaly-based IDS prototypes (both described in [13,14,16]) on the whole IDEVAL testing dataset. We ran the NIDS prototype on tcpdump data and collected 128 alerts for attacks against the host pascal.eyrie.af.mil [17]. The NIDS also generated 1009 alerts related to other hosts. Using the HIDS prototype we generated 1070 alerts from the dumps of the host pascal.eyrie.af.mil. With respect to these alerts, the NIDS was capable of detecting almost 66% of the attacks with less than 0.03% of false positives; the HIDS performs even better with a detection rate of 98% and 1.7% of false positives.

Fig. 2. p-value (-a) and GCI (-b) vs. p with w = w1 = 60s (1-) and w = w2 = 1800s

(2-) “NetP(k) HostP (k)” (dashed line), “HostP (k) NetP(k)” (solid line).


In the following, we use this shorthand notation: Net is the substream of all the alerts generated by the NIDS. HostP is the substream of all the alerts generated by the HIDS installed on pascal.eyrie.af.mil, while NetP regards all the alerts (with pascal as a target) generated by the NIDS; finally, NetO = Net\NetP indicates all the alerts (with all but pascal as a target) generated by the NIDS.


4 The Granger Causality Test


In [9] Qin and Lee propose an interesting algorithm for alert correlation which seems suitable also for anomaly-based alerts. Alerts with the same feature set are grouped into collections of time-sorted items belonging to the same “type” (following the concept of type of [8]). Subsequently, frequency time series are built, using a fixed size sliding-window: the result is a time-series for each collection of alerts. The prototype then exploits the GCT [18], a statistical hypothesis test capable of discovering causality relationships between two time series when they are originated by linear, stationary processes. The GCT gives a stochastic measure, called Granger Causality Index (GCI), of how much of the history of one time series (the supposed cause) is needed to “explain” the evolution of the other one (the supposed consequence, or target). The GCT is based on the estimation of two models: the first is an Auto Regressive model (AR), in which future samples of the target are modeled as influenced only by past samples of the target itself; the second is an Auto Regressive Moving Average eXogenous (ARMAX) model, which also takes into account the supposed cause time series as an exogenous component. A statistical F-test built upon the model estimation errors selects the best-fitting model: if the ARMAX fits better, the cause effectively influences the target.


In [9] the unsupervised identification of “causally related” events is performed by repeating the above procedure for each couple of time-series. The advantage of the approach is that it does not require prior knowledge (even if it may use attack probability values, if available, for an optional prioritization phase). However, in a previous work [2] we showed that the GCT fails however in recognizing “meaningful” relationships between IDEVAL attacks.


We tested the sensitivity of the GCT to the choice of two parameters: the sampling time, w, and the time lag p (that is, the order of the AR). In our simple


Fig. 3. The optimal time lag ˆp given by the AIC criterion strongly varies over time.


experiment, the expected result is that NetP HostP , and that HostP 6 NetP (the indicates “causality” while 6 is its negation). In [9] the sampling time was arbitrarily set to w = 60s, while the choice of p is not documented. However, our experiments show that the choice of these parameters can strongly influence the results of the test. In Fig. 2 (1-a/b) we plotted the p-value and the GCI of the test for different values of p (w = 60s). In particular, the dashed line corresponds to the test NetP (k) HostP (k), and the solid line to the test HostP (k) NetP (k).We recall that if the p-value is lower than the significance level, the null hypothesis is refused. Notice how different choices of p can lead to inconclusive or even opposite results. For instance, with = 0.20 and with 2  p  3, the result is that NetP (k) HostP (k) and that HostP (k) 6 NetP (k). As we detailed in [2] (Fig. 2 (2-a/b)), other values of p lead to awkward result that both HostP (k) NetP (k) and NetP (k) HostP (k).


A possible explanation is that the GCT is significant only if both the linear regression models are optimal, in order to calculate the correct residuals. If we use the Akaike Information Criterion (AIC) [19] to estimate the optimal time lag ˆp over different windows of data, we find out that ˆp wildly varies over time, as it is shown in Fig. 3. The fact that there is no stable optimal choice of p, combined with the fact that the test result significantly depends on it, makes us doubt that the Granger causality test is a viable option for general alert correlation. The choice of w seems equally important and even more difficult to perform, except by guessing.


Of course, our testing is not conclusive: the IDEVAL alert sets may simply not be adequate for showing causal relationships. Another, albeit more unlikely, explanation, is that the Granger causality test may not be suitable for anomaly detection alerts: in fact, in [9] it has been tested on misuse detection alerts. But in fact there are also theoretical reasons to doubt that the application of the Granger test can lead to stable, good results. First, the test is asymptotic w.r.t.p meaning that the results reliability decreases as p increases because of the loss of degrees of freedom. Second, it is based on the strong assumption of linearity in the auto-regressive model fitting step, which strongly depends on the observed phenomenon. In the same way, the stationarity assumption of the model does not always hold.


5 Modeling alerts as stochastic processes


Instead of interpreting alert streams as time series (as proposed by the GCTbased approach), we propose to change point of view by using a stochastic model in which alerts are modeled as (random) events in time. This proposal can be seen as a formalized extension of the approach introduced in [1], which correlates alerts if they are fired by different IDS within a “negligible” time frame, where “negligible” is defined with a crisp, fixed threshold.


For simplicity, once again we describe our technique in the simple case of a single HIDS and a single NIDS which monitors the whole network. The concepts, however, can be easily generalized to take into account more than two alert streams, by evaluating them couple by couple. For each alert, we have three essential information: a timestamp, a “target” host (fixed, in the case of the HIDS, to the host itself), and the generating sensor (in our case, a binary value).


We reuse the scenario and data we already presented in Section 4 above. With a self-explaining notation, we also define the following random variables: TNetP are the arrival times of network alerts in NetP (TNetO, THostP are similarly defined); "NetP ("NetO) are the delays (caused by transmission, processing and different granularity in detection) between a specific network-based alert regarding pascal (not pascal) and the corresponding host-based one. The actual values of each T(·) is nothing but the set of timestamps extracted from the corresponding alert stream. We reasonably assume that "NetP and TNetP are stochastically independent (the same is assumed for "NetO and TnetO).


In an ideal correlation framework with two equally perfect IDS with a 100% DR and 0% FPR, if two alert streams are correlated (i.e., they represent independent detections of the same attack occurrences by different IDSes [1]), they also are “close” in time. NetP and HostP should evidently be an example of such a couple of streams. Obviously, in the real world, some alerts will be missing (because of false negatives, or simply because some of the attacks are detectable only by a specific type of detector), and the distances between related alerts will therefore have some higher variability. In order to account for this, we can “cut off” alerts that are too far away from a corresponding alert in the other time series, presuming them to be singletons. In our case, knowing that single attacks did not last more than 400s in the original dataset, we tentatively set a cutoff threshold at this point.


Under the given working assumptions and the proposed stochastic model, we can formalize the correlation problem as a set of two statistical hypothesis tests:

H0 : THostP 6= TNetP + "NetP vs. H1 : THostP = TNetP + "NetP (1)

H0 : THostP 6= TNetO + "NetO vs. H1 : THostP = TNetO + "NetO (2)


Let {ti,k} be the observed timestamps of Ti 8i 2 {HostP,NetP,NetO}, the meaning of the first test is straightforward: within a random amount of time, "NetP , the occurring of a host alert, tHostP,k, is preceded by a network alert, tNetP,k. If this does not happen for a statistically significant amount of events, the test result is that alert stream TNetP is uncorrelated to THostP ; in this case, we have enough statistical evidence for refusing H1 and accepting the null one. Symmetrically, refusing the null hypothesis of the second test means that the NetO alert stream (regarding to all hosts but pascal) is correlated to the alert stream regarding pascal.


Fig. 4. Histograms vs. est. density (red dashes) and Q-Q plots, for both fˆO and fˆP .


Note that, the above two tests are strongly related to each other: in an ideal correlation framework, it cannot happen that both “NetP is correlated to HostP ” and “NetO is correlated to HostP ”: this would imply that the network activity regarding to all hosts but pascal (which raises NetO) has to do with the host activity of pascal (which raises HostP ) with the same order of magnitude of NetP , that is an intuitively contradictory conclusion. Therefore, the second test acts as a sort of “robustness” criterion.

From our alerts, we can compute a sample of "NetP by simply picking, for each value in NetP , the value in HostP which is closest, but greater (applying a threshold as defined above). We can do the same for "NetO, using the alerts in NetO and HostP .


The next step involves the choice of the distributions of the random variables we defined above. Typical distributions used for modeling random occurrences of timed events fall into the family of exponential Probability Density Functions (PDF)s [20]. In particular, we decided to fit them with Gamma PDFs, because our experiments show that such a distribution is a good choice for both the "NetP and "NetO.


The estimation of the PDF of "NetP , fP := f"NetP , and "NetO, fO := f"NetO, is performed using the well known Maximum Likelihood (ML) technique [21] as implemented in the GNU R software package: the results are summarized in Fig. 4. fP and fO are approximated by Gamma[3.0606, 0.0178] and Gamma[1.6301, 0.0105], respectively (standard errors on parameters are 0.7080, 0.0045 for fP and 0.1288, 0.009 for fO). From now on, the estimator of a given density f will be indicated as ˆ f.


Fig. 4 shows histograms vs. estimated density (red, dashed line) and quantilequantile plots (Q-Q plots), for both ˆ fO and ˆ fP . We recall that Q-Q plots are an intuitive graphical “tool” for comparing data distributions by plotting the quantile of the first distribution against the quantile of the other one.


Considering that the samples sizes of "(·) are around 40, Q-Q plots empirically confirms our intuition: in fact, ˆ fO and ˆ fP are both able to explain real data well, within inevitable but negligible estimation errors. Even if ˆ fP and ˆ fO are both Gamma-shaped, it must be noticed that they significantly differ in their parametrization; this is a very important result since it allows to set up a proper criterion to decide whether or not "NetP and "NetO are generated by the same phenomenon.


Fig. 5. Histograms vs. est. density (red dashes) for both fˆO and fˆP (IDEVAL 1998)


Given the above estimators, a more precise and robust hypotheses test can be now designed. The Test 1 and 2 can be mapped into two-sided Kolmogorov-Smirnov (KS) tests [20], achieving the same result in terms of decisions:

H0 : "NetP  fP vs. H1 : "NetP 6 fP (3)

H0 : "NetO  fO vs. H1 : "NetO 6 fO (4)


where the symbol  means “has the same distribution of”. Since we do not know the real PDFs, estimators are used in their stead. We recall that the Kstest is a non-parametric test to compare a sample (or a PDF) against a PDF (or a sample) to check how much they differs from each other (or how much they fit). Such tests can be performed, for instance, with ks.test() (a GNU R native procedure): resulting p-values on IDEVAL 1999 are 0.83 and 0.03, respectively. Noticeably, there is a significant statistical evidence to accept the null hypothesis of Test 3. It seems that the ML estimation is capable of correctly fitting a Gamma PDF for fP (given "NetP samples), which double-checks our intuition about the distribution. The same does not hold for fO: in fact, it cannot be

correctly estimated, with a Gamma PDF, from "NetO. The low p-value for Test 4 confirms that the distribution of "NetO delays is completely different than the one of "NetP . Therefore, our criterion doest not only recognize noisy delay-based relationships among alerts stream if they exists; it is also capable of detecting if such a correlation does not hold.


We also tested our technique on alerts generated by our NIDS/HIDS running on IDEVAL 1998 (limiting our analysis to the first four days of the first week), in order to cross-validate the above results. We prepared and processed the data with the same procedures we described above for the 1999 dataset. Starting from almost the same proportion of host/net alerts against either pascal or other hosts, the ML-estimation has computed the two Gamma densities shown in Fig. 5: fP and fO are approximated by Gamma(3.5127, 0.1478) and Gamma(1.3747, 0.0618), respectively (standard errors on estimated parameters are 1.3173, 0.0596 for fP and 0.1265, 0.0068 for fO). These parameter are very similar to the ones we estimated for the IDEVAL 1999 dataset. Furthermore, with p-values of 0.51 and 0.09, the two KS tests confirm the same statistical discrepancies we observed on the 1999 dataset. The above numerical results show that, by interpreting alert streams as random processes, there are several (stochastic) dissimilarities between net-to-host delays belonging to the same net-host attack session, and net-to-host delays belonging to different sessions. Exploiting these dissimilarities, we may find out the correlation among streams in an unsupervised manner, without the need to predefine any parameter.


6 Conclusions


In this paper we analyzed the use of of different types of statistical tests for the correlation of anomaly detection alerts, a problem which has little or no solutions available today. One of the few correlation proposals that can be applied to anomaly detection is the use of a Granger Causality Test (GCT). After discussing a possible testing methodology, we observed that the IDEVAL datasets

traditionally used for evaluation have various shortcomings, that we partially addressed by using the data for a simpler scenario of correlation, investigating only the link between a stream of host-based alerts for a specific host, and the corresponding stream of alerts from a network based detector.


We examined the usage of a GCT as proposed in earlier works, showing that it relies on the choice of non-obvious configuration parameters which significantly affect the final result. We also showed that one of these parameters (the order of the models) is absolutely critical, but cannot be uniquely estimated for a given system. Instead of the GCT, we proposed a simpler statistical model of alert generation, describing alert streams and timestamps as stochastic variables, and showed that statistical tests can be used to create a reasonable criterion for distinguishing correlated and non correlated streams.We proved that our criteria work well on the simplified correlation task we used for testing, without requiring complex configuration parameters.


This is an exploratory work, and further investigations of this approach on real, longer sequences of data, as well as further refinements of the tests and the criteria we proposed, are surely needed. Another possible extension of this work is the investigation of how these criteria can be used to correlate anomaly and misuse-based alerts together, in order to bridge the gap between the existing paradigms of intrusion detection.


Acknowledgments


We need to thank prof. Ilenia Epifani and prof. Giuseppe Serazzi for their helpful comments, as well as the anonymous reviewers.