14 Oct 2020
Recently, I wrote a python script that allow you to collect and extract data from deeply nested structures without the need for writing boilerplate loops and list/dictionary access code. The idea is similar to regular expressions where you specify a pattern to match, except in this case, you specify a pattern to collect nested data.
Let’s go through a simple example to understand how this works. Given the following data (a random json data):
[
{
"error1": {
"type": "Runtime Error",
"occurrence": [
{"line": 10, "message": "fail"},
{"line": 20, "message": "block"},
],
},
"error2": {
"type": "Compiler Error",
"occurrence": [
{"line": 50, "message": "fail"},
{"line": 64, "message": "xyz"},
{"line": 70, "message": "pqr"},
],
},
"1error": {
"type": "Runtime Error",
"occurrence": [
{"line": 100, "message": "fail"},
{"line": 200, "message": "block"},
],
},
"2error": {
"type": "Compiler Error",
"occurrence": [
{"line": 500, "message": "fail"},
{"line": 640, "message": "xyz"},
{"line": 700, "message": "pqr"},
],
},
},
{
"error": {
"type": "Brain malfunctioned",
"occurrence": [
{"line": 150, "message": "abort!"},
{"line": 23, "message": "shutdown"},
],
},
"error": {
"type": "Computer crashed",
"occurrence": [
{"line": 341, "message": "blocked"},
{"line": 4, "message": "blocked"},
{"line": 74, "message": "math error"},
],
},
}
]
Let’s say we only want to collect line numbers. You have to specify the following pattern and it’ll give you all the line numbers in a list.
Pattern:
# Pattern
pattern = "_all_, _all_, occurrence, _all_, line"
# Output
[10, 20, 50, 64, 70, 500, 640, 700, 100, 200, 341, 4, 74]
The first thing you notice is that the pattern is a string with some
comma-separated items. And it contains some mysterious _all_
in it.
What’s _all_
? By putting the _all_
property you’re essentially
asking the collector to loop through any items (be it a list or a
dictionary). The first _all_
loops through the first level of items
(two dictionaries in this case) inside the root list. The second
_all_
loops through the values of the dictionaries (ignoring the
dictionary keys). Next, we’re asking for the occurrence
items which
are present inside each of the dictionaries at the third level. Notice
that, occurrence
itself is yet another list. So, we loop through
_all_
the items inside it. And lastly, we collect line
. Done!
Let’s see a few more patterns and their outputs:
pattern_1 = "_all_, _all_, type"
# Output of pattern 1
['Runtime Error', 'Compiler Error', 'Compiler Error', 'Runtime Error', 'Computer crashed']
pattern_2 = "_all_, *error, type"
# Output of pattern 2
['Compiler Error', 'Runtime Error', 'Computer crashed']
pattern_3 = "_all_, error*, occurrence, _all_, line"
# Output of pattern_3
[10, 20, 50, 64, 70, 341, 4, 74]
Notice that you can specify partial names for dictionary keys using
*
before or after a name. In the case of pattern 2, we wanted
all errors ending with the word “error”. So the pattern is *error
.
In case of pattern 3 we wanted all errors ending with the word
“error”, so the pattern is error*
.
[insert github link here later]