Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Logparser – Alternative to GoAccess Written in Python (github.com/lucianmarin)
54 points by lcnmrn on Sept 23, 2021 | hide | past | favorite | 18 comments


I am skeptical of those benchmarks. This is written in Python, and, looking at the core loop, yes, it really is Python, not Python wrapped around C or some other acceleration technology. For pure Python to come out appearing to get four times the through put of a C program is pretty dubious. That would have to be one crappy C program. GoAccess looks like it ought to be far enough along that somebody has at least taken a bit of a crack at optimization, but, perhaps not. C ought to be able to smoke pure Python at this task. (Possibly, you know, unsafely, where a crafted referrer may get to arbitrary code execution or something, but still it ought to be way faster.)


> This is written in Python, and, looking at the core loop, yes, it really is Python, not Python wrapped around C or some other acceleration technology

It seems to use a library clfparser to parse apache common log format logs; internally that uses Python's regex engine which is written in C.

6000 line/s seems incredibly slow to me for a C program parsing a log file. I'm seeing a lot or strstr's, strlen's, strdup's, and strchr's in GoAccess's parse.c, all of which are O(n) per line and, while fine in isolation, could be causing GoAccess to do quite a bit more work per line than just using an optimized regex engine.


I wonder what percentage of real-world C programs are exponentially slower than they could be because of the str functions


Thank you. That does sound like something that could resolve my skepticism into concrete facts.


The license is currently just

    All rights reserved.

    Copyright (c) 2020 Lucian Marin
It was the MIT license at the time of initial commit, and been updated to this. So it's not immediately clear if anyone else can necessarily use Logparser - care to clarify, Lucian?


Suggestion: take the time to package this up for PyPI as something people can install using "pip install" (or "pipx install").

This is hard the first time you do it, but worth learning because it's a really great way to distribute your Python software.

I'm giving a talk about how to do this at PyGotham next month, but the notes from that talk are already available and may be useful to you: https://github.com/simonw/pygotham-packaging

You may also find this cookiecutter template that I use to build and package Python CLI apps helpful: https://github.com/simonw/click-app


Is there a way to distribute proprietary software with PyPI? Based on the license text it appears the author wishes to keep it proprietary (maybe source-available, but not open source).


I suppose you actually mean close source? Because it’s trivial to distribute proprietary code on PyPI: Just say that in your license.

There is no true “close source” for pure Python programs, but if obfuscation is close enough, you can choose to only deploy wheels containing pre-compiled pyc files. This is good enough for most situations.


On a tangent, I've been looking into log parsing for an application I'm building recently.

If you want to support pulling info out of common logs it's pretty simple to pull together a list of regexes for the default log format in each major system. Simple example here: https://github.com/multiprocessio/datastation/blob/master/sh....

I use this in the app to be able to quickly pull info out of access logs for further analysis a la OP's app and GoAccess but in a GUI where you can also do further processing.

Demo video of this here: https://www.youtube.com/watch?v=sCx2mF2jyUQ&t=9s.


You can find a very comprehensive list of regex patterns looking at the logstash’s grok definitions:

https://github.com/logstash-plugins/logstash-patterns-core/t...


Are you certain your benchmarks are correct? The GoAccess FAQ states that it parses over 100,000 lines/second [1]. While this figure depends on the hardware used, this still is massively faster than the figure quoted in the README. Benchmarking is quite technical if you want consistent results, so some more information on the benchmarking methodology used here would be much appreciated.

[1] https://goaccess.io/faq#performance


To be fair, GoAccess does a bit more (is has that websockets live view)


That's not in the parse loop - where comparison is happening.


Still, there's a lot more data outputting from goaccess with support for custom logs.


Im not sure its an alternative yet, functionally it seems that it misses incremental parsing, live updates, interactive html and tui interfaces, graphs,...


Seems like a confusing name given that logparser for IIS log files has been around for a very long time.


IMO this could benefit from using `collections.Counter` instead of `defaultdict(set)`.


With a `Counter` you would be counting each access from a given IP as a hit against a category, rather than counting the IP itsef.

Currently if 4 clients hit one URL and 1 client hits 5 logparser will register 5 records in each category (unless they're classified as bots for browsers and systems). With a Counter, it'd be 9.

Both informations could be accessible using a `defaultdict(Counter)` but I don't know how useful that would be to the people actually using logparser.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: