Anomaly detection using Python [closed]

2023-03-24 19:00 问答作者：

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 7 years ago.

Improve this question

I work for a webhost and my job is to find and cleanup hacked accounts. The way I find a good 90% of shells\malware\injections is to look for files that are "out of place." For example, eval(base64_decode(.......)), where "....." is a whole bunch of base64'ed text that is usually never good. Odd looking files jump out at me as I grep through files for key strings.

If these files jump out at me as a human I'm sure I can build some kind of profiler in python to look for things that are "out of place" statistically and flag them for manual review. To start off I thought I can compare the length of lines in php files containing key strings (eval, base64_decode, exec, gunzip, gzinflate, fwrite, preg_replace, etc.) and look for lines that deviate from the average by 2 standard deviations.

The line length varies widely and I'm not sure if this would be a good statistic to use. Another approach would be to assign weighted rules to开发者_如何学Python cretin things (line length over or under threshold = X points, contains the word upload = Y points) but I'm not sure what I can actually do with the scores or how to score the each attribute. My statistics is a little rusty.

Could anyone point me in the right direction (guides, tutorials, libraries) for statistical profiling?

Here's a simple machine learning approach to the problem, and is what I'd do to get started on this problem and develop a baseline classifier:

Build up a corpus of scripts and attach a label either 'good' (label= 0) or 'bad' (label = 1) the more the better. Try to ensure that the 'bad' scripts are a reasonable fraction of the total corpus,50-50 good/bad is ideal.

Develop binary features that indicate suspicious or bad scripts. For example, the presence of 'eval', the presence of 'base64_decode'.Be as comprehensive as you can be and don't be afraid of including afeature that might capture some 'good' scripts too. One way to help to do this might be to calculate the frequency counts of words in the two classes of script and select as features words that appear prominently in 'bad' but less prominently in 'good'.

Run the feature generator over the corpus and build up a binary matrix of features with labels.

Split the corpus into train (80% of examples) and test sets (20%). Using the scikit learn library, train a few different classification algorithms (random forests, support vector machines, naive bayes etc) with the training set and test their performance on the unseen test set.

Hopefully I have a reasonable classification accuracy to benchmark against. I'd then look at improving the features, some unsupervised methods (without labels) and more specialised algorithms to get better performance.

For resources, Andrew Ng's Coursera course on Machine Learning (which includes example spam classification, I believe) is a good start.

继续阅读：intrusion-detection machine-learning python statistics

Anomaly detection using Python [closed]

更多精彩内容

精彩评论

最新问答

绝区零和崩坏星穹铁道谁更吃配置?？

电视机蓝屏,边缘处带红是哪里的毛病？

双侧输卵管远端积水怎么治疗？

关于学生看电视有什么好处与坏处?？

永劫无间手游崔三娘魂玉怎么选择?？

问答排行榜

Escaping "<" in Perl-generated XML

微信重新建群怎么建？

imessage会显示已读吗？

太快了能不能慢一点好爽~好大~不要拔出来了？

二年级家长回音怎么写大全简短的（二年级家长回音怎么写）？