What to use for real-time log aggregation and querying?
Right now I was lookin开发者_运维问答g at HDFS+HBase which seems like a good solution. Are there any alternatives? Can you recommend anything?
You can check Flume: https://github.com/cloudera/flume/wiki .
You can have a look at calamaris. In the commercial world there's Splunk.
If you try to parse/collect logs in real-time, and do something about it then my recomendation is the following:
# tail --follow=name --retry /var/log/logfile.log | sendxmpp -i -u username -p password -j somejabberserver.com sendloglineto@somejabberserver.com
That would send each line in the log as it appears as XMPP message to the jabber user sendloglineto@somejabberserver.com. That jabber user would be one connected via client/software written by you (I prefer perl and Net::Jabber). You can program the client to do whatever you want it to do with each XMPP message (e.g. store in database). If you store it in CouchDB, you can use _changes API to track updates of particular database served by CouchDB.
Eventhough, its old question, I am posting the answer with technical stack which are available now...
Data Ingestion : Apache Flume or Spark streaming or Spring XD or Kafka
Data Storage and processing: HBASE(rawdata in staging table and aggregated data in final tables based on the requirements, based on the ranges of search ,can design rowkeys) + SparkonHbase
Real time search : Hbase with solr indexes
Reporting(optional) : tableu or Banana(open source)
Overall : Lambda architecture
Try Apache Kafka. It should be helpful for your case
精彩评论