I am skeptical of those benchmarks. This is written in Python, and, looking at the core loop, yes, it really is Python, not Python wrapped around C or some other acceleration technology. For pure Python to come out appearing to get four times the through put of a C program is pretty dubious. That would have to be one crappy C program. GoAccess looks like it ought to be far enough along that somebody has at least taken a bit of a crack at optimization, but, perhaps not. C ought to be able to smoke pure Python at this task. (Possibly, you know, unsafely, where a crafted referrer may get to arbitrary code execution or something, but still it ought to be way faster.)
> This is written in Python, and, looking at the core loop, yes, it really is Python, not Python wrapped around C or some other acceleration technology
It seems to use a library clfparser to parse apache common log format logs; internally that uses Python's regex engine which is written in C.
6000 line/s seems incredibly slow to me for a C program parsing a log file. I'm seeing a lot or strstr's, strlen's, strdup's, and strchr's in GoAccess's parse.c, all of which are O(n) per line and, while fine in isolation, could be causing GoAccess to do quite a bit more work per line than just using an optimized regex engine.
All rights reserved.
Copyright (c) 2020 Lucian Marin
It was the MIT license at the time of initial commit, and been updated to this. So it's not immediately clear if anyone else can necessarily use Logparser - care to clarify, Lucian?
Suggestion: take the time to package this up for PyPI as something people can install using "pip install" (or "pipx install").
This is hard the first time you do it, but worth learning because it's a really great way to distribute your Python software.
I'm giving a talk about how to do this at PyGotham next month, but the notes from that talk are already available and may be useful to you: https://github.com/simonw/pygotham-packaging
Is there a way to distribute proprietary software with PyPI? Based on the license text it appears the author wishes to keep it proprietary (maybe source-available, but not open source).
I suppose you actually mean close source? Because it’s trivial to distribute proprietary code on PyPI: Just say that in your license.
There is no true “close source” for pure Python programs, but if obfuscation is close enough, you can choose to only deploy wheels containing pre-compiled pyc files. This is good enough for most situations.
I use this in the app to be able to quickly pull info out of access logs for further analysis a la OP's app and GoAccess but in a GUI where you can also do further processing.
Are you certain your benchmarks are correct? The GoAccess FAQ states that it parses over 100,000 lines/second [1]. While this figure depends on the hardware used, this still is massively faster than the figure quoted in the README. Benchmarking is quite technical if you want consistent results, so some more information on the benchmarking methodology used here would be much appreciated.
Im not sure its an alternative yet, functionally it seems that it misses incremental parsing, live updates, interactive html and tui interfaces, graphs,...
With a `Counter` you would be counting each access from a given IP as a hit against a category, rather than counting the IP itsef.
Currently if 4 clients hit one URL and 1 client hits 5 logparser will register 5 records in each category (unless they're classified as bots for browsers and systems). With a Counter, it'd be 9.
Both informations could be accessible using a `defaultdict(Counter)` but I don't know how useful that would be to the people actually using logparser.