Skip to main content
ALPHA    This is new software undergoing tests! Thank you for your patience.

Candidate: Stephen Hawkes hawkz Assessed by: Nicholas Tollervey ntoll

Python (2023) ~ Grade 3 (Lower)

Regular Postcodes

Postcodes are great, they're a short way to figure out where someone is and can be really useful in lots of contexts. Often someone will need to work out if their postcode falls into one region or another. E.g. I know my postcode, but I don't know my NHS CCG area, or I know my postcode, but I don't know my electoral ward. Often this is solved with a system that uses a database to store all the postcodes in a table with the associated values. Brilliant. Except it sucks for small use cases or for static web projects. In this grade I was curious to see if I could write a project to see if I could convert the list of postcodes into a regular expression (or something similar) so that a static process could determine the region without a database lookup.

Attached files
Filename (click to download) Size Uploaded by
pcodey60.zip 1.0 MB hawkz
Markdown code
[pcodey60.zip](/media/assessment/86a9f52f/2022-03-31/09-19-49/pcodey60.zip){target="_blank"}
run.py 709 bytes hawkz
Markdown code
```Python
#!/usr/bin/env python
import csv

results = {}
with open('all.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        if row[1] in results:
            if row[0].split(" ")[0] not in results[row[1]]:
                results[row[1]].append(row[0].split(" ")[0])
        else:
            results[row[1]] = [row[0].split(" ")[0]]

print(results.keys())
print(results)


# outward = {}
# with open('all.csv') as f:
#     for line in f:
#         code = line.strip().split(" ")
#         if code[0] not in outward:
#             outward[code[0]] = [code[1]]
#         else:
#             outward[code[0]].append(code[1])

# print(outward.keys())
```
all.csv 6.2 MB hawkz
Markdown code
[all.csv](/media/assessment/86a9f52f/2022-03-31/19-58-11/all.csv){target="_blank"}
all.csv 6.2 MB hawkz
Markdown code
[all.csv](/media/assessment/86a9f52f/2022-04-11/08-00-18/all.csv){target="_blank"}
run5.py 2.3 KB hawkz
Markdown code
```Python
#!/usr/bin/env python
import csv
import re

testfile = 'all.csv'
results = {}
size = 0
answers = []
with open(testfile, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        size += 1
        p = row[0].replace(' ','')
        answer = row[1]
        answers.append(answer)

        for a in range(1,len(p)+1):
            if p[0:a] in results:
                if answer not in results[p[0:a]]:
                    results[p[0:a]].append(answer)
            else:
                results[p[0:a]] = [answer]

remove = []
for x in set(results):
    if x[:-1] in results:
        if len(results[x[:-1]]) == 1:
            remove.append(x)

for x in set(remove):
    del results[x]

# print('%s total results' % size)
# print('%s new total results' % len(results))

def op(a, s, indent):
    if indent == 1:
        all = [x for x in results if a in results[x] and len(x) == indent]
    else:
        all = [x for x in results if a in results[x] and len(x) == indent and x.startswith(s)]
    output = ''
    for num, t in enumerate(all): 
        if num: output += '|'
        if indent == 1: output += '^'
        output += t[-1]
        if indent <= 7:
            newindent = indent + 1
            children = [x for x in results if a in results[x] and len(x) == newindent and x.startswith(t)]
            if children:
                if len(children) > 1: output += '(?:'
                output += op(a, t, newindent)
                if len(children) > 1: output += ')'
            else:
                output += ''
    return output

formula = {}
for answer in set(answers):
    formula[answer] = op(answer, '', 1)
    print(answer, formula[answer])


# win = 0
# lose = 0
# total = 0
# with open(testfile, newline='') as csvfile:
#     reader = csv.reader(csvfile, delimiter=',', quotechar='"')
#     for row in reader:
#         postcode = row[0].replace(' ','')
#         answer = row[1]
#         total += 1

#         results = {}
#         for test in formula:
#             m = re.search(formula[test], postcode)
#             if m:
#                 results[m.end()] = test

#         print(postcode, answer, results)
#         if results.keys() and results[sorted(results.keys())[-1]] == answer:
#             win +=1
#         else:
#             lose +=1

# print(total, win, lose, (win/total)*100)

```
run4.py 5.2 KB hawkz
Markdown code
```Python
#!/usr/bin/env python
import csv
import re
import json
import pprint
from collections import defaultdict

testfile = 'small2.csv'

def tree(tree):
    print(json.dumps(dict(tree), indent=3, sort_keys=True))

def merge(a, b, path=None):
    "merges b into a"
    if path is None: path = []
    for key in b:
        if key in a:
            if isinstance(a[key], dict) and isinstance(b[key], dict):
                merge(a[key], b[key], path + [str(key)])
            elif a[key] == b[key]:
                pass # same leaf value
            else:
                raise Exception('Conflict at %s' % '.'.join(path + [str(key)]))
        else:
            a[key] = b[key]
    return a

def count_keys(dict_, counter=0):
    for each_key in dict_:
        if isinstance(dict_[each_key], dict):
            # Recursive call
            counter = count_keys(dict_[each_key], counter + 1)
        else:
            counter += 1
    return counter

results = defaultdict(dict)
size = 0
with open(testfile, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        size += 1
        p = row[0].replace(' ','')
        answer = row[1]
        postdict = eval("{'" + "': {'".join(['%s' % x for x in p]) + "':{}" +(len(['{%s:' % x for x in p]) * "}"))
        results[answer] = merge(results[answer], postdict)

# pprint.pprint(results)
print('%s rows in the sheet' % size)
print('%s keys in tree' % count_keys(results))

tree(results)

def recursive_get(d, keys):
    if len(keys) == 1:
        return d[keys[0]]
    return recursive_get(d[keys[0]], keys[1:])

for answer in results:
    for first in results[answer]:
        # C
        unique = True
        for other in [x for x in results if answer != x]:
            if [x for x in results[other] if x == first]:
                # unique
                unique = False
            if unique:
                results[answer][first] = {}

        
            # else:
            #     # not unique -> children
            #     for second in results[answer][first]:
            #         if not [y for y in otheranswers if second in results[y][first]]:
            #             # unique
            #             results[answer][first][second] = {}
            #         else:
            #             # not unique -> children
            #             for third in  results[answer][first][second]:
            #                 if not [y for y in otheranswers if third in results[y][first][second]]:
            #                     # unique
            #                     results[answer][first][second][third] = {}
            #                 else:
            #                     # not unique -> children
            #                     for fourth in results[answer][first][second][third]:
            #                         if not [y for y in otheranswers if fourth in results[y][first][second][third]]:
            #                             # unique
            #                             results[answer][first][second][third][fourth] = {}
            #                         else:
            #                             # not unique -> children
            #                             for fifth in results[answer][first][second][third][fourth]:
            #                                 if not [y for y in otheranswers if fifth in results[y][first][second][third][fourth]]:
            #                                     # unique
            #                                     results[answer][first][second][third][fourth][fifth] = {}
            #                                 else:
            #                                     # not unique -> children
            #                                     for sixth in results[answer][first][second][third][fourth][fifth]:
            #                                         if not [y for y in otheranswers if fifth in results[y][first][second][third][fourth][fifth]]:
            #                                             # unique
            #                                             results[answer][first][second][third][fourth][fifth][sixth] = {}


print('%s keys in tree' % count_keys(results))
tree(results)

# ##########

# win = 0
# lose = 0
# total = 0
# with open(testfile, newline='') as csvfile:
#     reader = csv.reader(csvfile, delimiter=',', quotechar='"')
#     for row in reader:
#         postcode = row[0].replace(' ','')
#         answer = row[1]
#         total += 1

#         results = {}
#         for test in out:
#             m = re.search('^('+out[test]+")", postcode)
#             if m: 
#                 end = m.end()
#                 results[end] = test
#         if results[sorted(results.keys())[-1]] == answer:
#             win +=1
#         else:
#             lose +=1
#             # print(postcode, answer, results[sorted(results.keys())[-1]])

# print(total, win, lose, (win/total)*100)

# def walk_dict(d):
#     text = ''
#     for k,v in d.items():
#         if isinstance(v, dict):
#             text += k + "("
#             text += walk_dict(v)
#             text += ')'
#         else:
#             text += "%s %s" % (k, v)
#     return text

# for answer in results:
#     print (answer + ':  ', walk_dict(results[answer]))
```
run3.py 3.1 KB hawkz
Markdown code
```Python
#!/usr/bin/env python
import csv
import re
from collections import defaultdict
import pprint


testfile = 'small2.csv'

results = defaultdict(dict)
with open(testfile, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        postcode = row[0].replace(' ','')
        answer = row[1]
        
        for num, l in enumerate(postcode):
            if num in results[answer]:
                 results[answer][num].append(postcode[0:num+1])
            else:
                results[answer][num] = [postcode[0:num+1]]

original = 0
for answer in results:
    for x in results[answer]:
        results[answer][x] = list(set(results[answer][x]))
        original += len(results[answer][x])

        for part in results[answer][x]:
            unique = True
            for anotheranswer in results:
                if answer != anotheranswer: 
                    if x in results[anotheranswer]:
                        if part in results[anotheranswer][x]:
                            unique = False
                            continue
                else:
                    continue
            if unique:
                for z in range(x+1, len(results[answer])):
                    results[answer][z] = sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][z]))))

print("%s combinations" % original)

print(results)


# items = 0
# for answer in results:
#     for x in range(4):
#         items += len(results[answer][x])

# print("reduced to %s combinations, %s percent reduction" % (items, items/original ))

def recursechildren(answer, counter, part):
    if counter in results[answer]:
        for num, id in enumerate(sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][counter]))))):
            print(results[answer][counter][num].replace(part, ''), end = '')
            if counter+1 in results[answer]:
                print('(', end = '')
                for childpart in sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][counter+1])))):
                    recursechildren(answer, counter+1, childpart)
                print(')', end = '')
            print('|', end = '')


out = {}
for answer in results:
    print(answer, '>>>')
    for num, id in enumerate(results[answer]):
        recursechildren(answer, num, part)


# for r in out:
#     print (r, '>>', out[r])

# ##########

# win = 0
# lose = 0
# total = 0
# with open(testfile, newline='') as csvfile:
#     reader = csv.reader(csvfile, delimiter=',', quotechar='"')
#     for row in reader:
#         postcode = row[0].replace(' ','')
#         answer = row[1]
#         total += 1

#         results = {}
#         for test in out:
#             m = re.search('^('+out[test]+")", postcode)
#             if m: 
#                 end = m.end()
#                 results[end] = test
#         if results[sorted(results.keys())[-1]] == answer:
#             win +=1
#         else:
#             lose +=1
#             # print(postcode, answer, results[sorted(results.keys())[-1]])

# print(total, win, lose, (win/total)*100)


```
run2.py 3.6 KB hawkz
Markdown code
```Python
#!/usr/bin/env python
import csv
import re
from collections import defaultdict
import pprint


testfile = 'all.csv'

results = defaultdict(dict)
with open(testfile, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        postcode = row[0]
        answer = row[1]
        pc_parts = re.split('(\d+)',postcode)
        pc_parts.remove(' ')
        if answer not in results:
            results[answer] = {0: [], 1: [], 2: [], 3: []}
        for a in results[answer].keys():
            results[answer][a].append(''.join(pc_parts[0:a+1]))

original = 0
for answer in results:
    for x in range(4):
        results[answer][x] = list(set(results[answer][x]))
        original += len(results[answer][x])

print("%s combinations" % original)

for y in range(3):
    for answer in results:
        for part in results[answer][y]:
            unique = True
            for anotheranswer in results:
                if answer != anotheranswer: 
                    if part in results[anotheranswer][y]:
                        unique = False
                        continue
                else:
                    continue
            if unique:
                for z in range(y+1, 4):
                    results[answer][z] = sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][z]))))


items = 0
for answer in results:
    for x in range(4):
        items += len(results[answer][x])

print("reduced to %s combinations, %s percent reduction" % (items, items/original ))

out = {}
for answer in results:
    out[answer] = ''
    for number, part in enumerate(results[answer][0]):
        if number: out[answer] += '|'
        out[answer] += '^' + part

        if len(results[answer][1]) > 0:
            if len([x for x in results[answer][1] if x.startswith(part)]) > 1: out[answer] += '(?:'

            for number1, part1 in enumerate([x for x in results[answer][1] if x.startswith(part)]):
                if number1: out[answer] += '|'
                out[answer] += part1.replace(part, '')
                
                if len([x for x in results[answer][2] if x.startswith(part1)]) > 0:
                    out[answer] += "(?:"
                    for number2, part2 in enumerate([x for x in results[answer][2] if x.startswith(part1)]):
                        if number2: out[answer] += '|'
                        out[answer] += part2.replace(part1, '')


                        if len([x for x in results[answer][3] if x.startswith(part2)]):
                            endings = '|'.join([x[-2:] for x in results[answer][3] if x.startswith(part2)])
                            if len(endings):
                                if number2: out[answer] += '|'
                                out[answer] += '(?:' + endings + ')'

                    out[answer] += ")"
                
            if len([x for x in results[answer][1] if x.startswith(part)]) > 1: out[answer] += ')'


for r in out:
    print (r, '>>', out[r])

##########

win = 0
lose = 0
total = 0
with open(testfile, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        postcode = row[0].replace(' ','')
        answer = row[1]
        total += 1

        results = {}
        for test in out:
            m = re.search('^('+out[test]+")", postcode)
            if m: 
                end = m.end()
                results[end] = test
        if results[sorted(results.keys())[-1]] == answer:
            win +=1
        else:
            lose +=1
            # print(postcode, answer, results[sorted(results.keys())[-1]])

print(total, win, lose, (win/total)*100)


```
run.py 3.6 KB hawkz
Markdown code
```Python
#!/usr/bin/env python
import csv

# results = {}
# with open('all.csv', newline='') as csvfile:
#     reader = csv.reader(csvfile, delimiter=',', quotechar='"')
#     for row in reader:
#         if row[1] in results:
#             if row[0].split(" ")[0] not in results[row[1]]:
#                 results[row[1]].append(row[0].split(" ")[0])
#         else:
#             results[row[1]] = [row[0].split(" ")[0]]

# incodes = []
# dupes = []
# for outcode in results:
#     for incode in results[outcode]:
#         if incode not in incodes:
#             incodes.append(incode)
#         for othercode in results:
#             if othercode != outcode:
#                 if incode in results[othercode]:
#                     if incode not in dupes:
#                         dupes.append(incode)

# print("%s outcodes are in more than one region" % len(dupes))
# print("%s are unique to regions" % (len(incodes) - len(dupes)))
# print("%s all the outcodes" % len(incodes))

# print(results.keys())


results = {}
with open('all.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        if row[1] in results:
            if row[0].split(" ")[0] + row[0].split(" ")[1][0] not in results[row[1]]:
                results[row[1]][row[0].split(" ")[0]] = [row[0].split(" ")[0] + row[0].split(" ")[1][0]]
            else:
                results[row[1]][row[0].split(" ")[0]].append(row[0].split(" ")[0] + row[0].split(" ")[1][0])
        else:
            results[row[1]] = {
                row[0].split(" ")[0]: [row[0].split(" ")[0] + row[0].split(" ")[1][0]]
            }
# print(results)

# Find the unique outcodes
for region in results:
    results[region]['values'] = []
    for outcode in results[region]:
        unique = True
        for otherregion in results:
            if outcode in results[otherregion] and region != otherregion:
                unique = False
        if unique:
            results[region]['values'].append(outcode)

# remove uniques we've found
for region in results:
    for outcode in results[region]['values']:
        results[region].pop(outcode, None)

# find the uniques at incode level
for region in results:
    for outcode in list(results[region]):
        unique = True
        if outcode != 'values':
            for otherregion in results:
                for otheroutcode in results[otherregion]:
                    if outcode in results[otherregion][otheroutcode]:
                        unique = False
        if unique:
            if 'values' in results[region]:
                results[region]['values'].append(results[region][outcode][0])
                # results[region].pop(outcode, None)
            else: 
                results[region]['values'] = [results[region][outcode][0]]
                # results[region].pop(outcode, None)


shortlist = {}
for region in results:
    shortlist[region] = '^' + '|'.join(set(results[region]['values']))
print(shortlist)


import re

# verify
with open('all.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    win, lose = 0, 0
    for row in reader:
        pcode =  row[0].replace(' ','')
        result = ''
        for test in shortlist:
            val = re.search(shortlist[test], pcode)

            if val:
                # print(val, test, shortlist[test],  pcode)
                result = test
                break
        
        # print(pcode, row[1], result, row[1] == result)
        if row[1] == result:
            win += 1
        else: 
            lose += 1
            print(pcode)
        
# print(win, lose)

```
brum_postcode.png 58.8 KB ntoll
Markdown code
![brum_postcode.png](/media/assessment/86a9f52f/2022-11-27/20-40-22/brum_postcode.png "brum_postcode.png")

Status: Submitted for assessment.


hawkz Stephen Hawkes ~ 31 Mar 2022 9:21 a.m.

pcodey60.zip I have the NHS database for CCG areas. I wonder if I could change this from a huge CSV to a smaller way to lookup the data. Perhaps in a Python dictionary

E.g. {"^B(.*)": "15E" } This would mean any postcodes starting B, would give the 15E (Birmingham and Solihull CCG) region ID.

Wish me luck! Or enjoy the madness that follows.


hawkz Stephen Hawkes ~ 31 Mar 2022 7:25 p.m.

Each postcode consists of two parts. The first part is the outward postcode, or outcode. This is separated by a single space from the second part, which is the inward postcode, or in-code.

The outward postcode enables mail to be sent to the correct local area for delivery. This part of the code contains the area and the district to which the mail is to be delivered.

The inward postcode is used to sort the mail at the local area delivery office. It consists of a numeric character followed by two alphabetic characters. The numeric character identifies the sector within the postal district. The alphabetic characters then define one or more properties within the sector.

For example: PO1 3AX PO refers to the postcode area of Portsmouth. PO1 refers to a postcode district within the postcode area of Portsmouth PO1 3 refers to the postcode sector. PO1 3AX. The AX completes the postcode. The last two letters define the ‘unit postcode’ which identifies one or more small user delivery points or an individual large user


hawkz Stephen Hawkes ~ 31 Mar 2022 8:04 p.m.

Here's a quick demo that I built to try my theory out. It parses the CSV file with all the UK postcodes in, and the NHS CCG regions. Then I output the keys and whole dict to understand what reduction there is if any.

#!/usr/bin/env python
import csv

results = {}
with open('all.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        if row[1] in results:
            if row[0].split(" ")[0] not in results[row[1]]:
                results[row[1]].append(row[0].split(" ")[0])
        else:
            results[row[1]] = [row[0].split(" ")[0]]

print(results.keys())
print(results)

So.. as a quick test we can make a lot of saving from just the outward part of the postcode for the CCG regions. B13 only appears in the 15E region, so all B13 postcodes are in Birmingham and Solihull CCG. However B14 appears in more than region so I need to measure how unique the outcodes are next. If this had worked uniquely I could have reduced the need for a 6M CSV to a 5K text file.

all.csv


hawkz Stephen Hawkes ~ 31 Mar 2022 8:58 p.m. (updated: 31 Mar 2022 9:27 p.m.)

#!/usr/bin/env python
import csv

results = {}
with open('all.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        if row[1] in results:
            if row[0].split(" ")[0] not in results[row[1]]:
                results[row[1]].append(row[0].split(" ")[0])
        else:
            results[row[1]] = [row[0].split(" ")[0]]

incodes = []
dupes = []
for outcode in results:
    for incode in results[outcode]:
        if incode not in incodes:
            incodes.append(incode)
        for othercode in results:
            if othercode != outcode:
                if incode in results[othercode]:
                    if incode not in dupes:
                        dupes.append(incode)

print("%s outcodes are in more than one region" % len(dupes))
print("%s are unique to regions" % (len(incodes) - len(dupes)))
print("%s all the outcodes" % len(incodes))

Here I've found the following: 150 outcodes are in more than one region 272 are unique to regions 422 all the outcodes

So 64% of regions can be found by the outcode alone.


hawkz Stephen Hawkes ~ 11 Apr 2022 8:02 a.m.

After exploring further I was able to create a working algorithm to compress the postcodes into a regular expression:

#!/usr/bin/env python
import csv
import re

testfile = 'all.csv'
results = {}
size = 0
answers = []
with open(testfile, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        size += 1
        p = row[0].replace(' ','')
        answer = row[1]
        answers.append(answer)

        for a in range(1,len(p)+1):
            if p[0:a] in results:
                if answer not in results[p[0:a]]:
                    results[p[0:a]].append(answer)
            else:
                results[p[0:a]] = [answer]

remove = []
for x in set(results):
    if x[:-1] in results:
        if len(results[x[:-1]]) == 1:
            remove.append(x)

for x in set(remove):
    del results[x]

def op(a, s, indent):
    if indent == 1:
        all = [x for x in results if a in results[x] and len(x) == indent]
    else:
        all = [x for x in results if a in results[x] and len(x) == indent and x.startswith(s)]
    output = ''
    for num, t in enumerate(all): 
        if num: output += '|'
        if indent == 1: output += '^'
        output += t[-1]
        if indent <= 7:
            newindent = indent + 1
            children = [x for x in results if a in results[x] and len(x) == newindent and x.startswith(t)]
            if children:
                if len(children) > 1: output += '(?:'
                output += op(a, t, newindent)
                if len(children) > 1: output += ')'
    return output

formula = {}
for answer in set(answers):
    formula[answer] = op(answer, '', 1)
    print(answer, formula[answer])


win = 0
lose = 0
total = 0
with open(testfile, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        postcode = row[0].replace(' ','')
        answer = row[1]
        total += 1

        results = {}
        for test in formula:
            m = re.search(formula[test], postcode)
            if m:
                results[m.end()] = test

        if results.keys() and results[sorted(results.keys())[-1]] == answer:
            win +=1
        else:
            lose +=1

print(total, win, lose, (win/total)*100)

With the process above I can get a 6.6MB file to compress to 67KB regular expressions, with 100% of the codes being tested. all.csv

Note: There's one caveat in that regex matching is greedy, so you have to test which regular expression matches the most characters of the postcode in the case of similar codes. However this works really quickly once the the regex strings have been produced.


hawkz Stephen Hawkes ~ 11 Apr 2022 8:30 a.m.

With the regular expressions built, I'm now able to validate postcodes from a region with needing a database.

https://codepen.io/hawkz/pen/qBpKzaL

6.6MB database turned into a few kilobytes!


hawkz Stephen Hawkes ~ 11 Apr 2022 8:40 a.m.

Bumps along the way...

I had a few issues with different attempts, a combination of trying to avoid functions for tree traversal, and relearning regular expressions.

One gotcha was in regular expressions, where parts of postcode had more than on character. In the regular expressions I was generating [B|CV] thinking that was B or CV, but instead processed as BV or CV. I also experimented with different ways to process the CSV to the outputs minimising or maximising the amount of data stored. I numbered each iteration, they're all here.

Most of this uses the standard library so it should be a case of python run.py a couple of them used tdqm, which needs pip install tdqm in your virtual environment to work.

#!/usr/bin/env python
import csv
import re

testfile = 'all.csv'
results = {}
size = 0
answers = []
with open(testfile, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        size += 1
        p = row[0].replace(' ','')
        answer = row[1]
        answers.append(answer)

        for a in range(1,len(p)+1):
            if p[0:a] in results:
                if answer not in results[p[0:a]]:
                    results[p[0:a]].append(answer)
            else:
                results[p[0:a]] = [answer]

remove = []
for x in set(results):
    if x[:-1] in results:
        if len(results[x[:-1]]) == 1:
            remove.append(x)

for x in set(remove):
    del results[x]

# print('%s total results' % size)
# print('%s new total results' % len(results))

def op(a, s, indent):
    if indent == 1:
        all = [x for x in results if a in results[x] and len(x) == indent]
    else:
        all = [x for x in results if a in results[x] and len(x) == indent and x.startswith(s)]
    output = ''
    for num, t in enumerate(all): 
        if num: output += '|'
        if indent == 1: output += '^'
        output += t[-1]
        if indent <= 7:
            newindent = indent + 1
            children = [x for x in results if a in results[x] and len(x) == newindent and x.startswith(t)]
            if children:
                if len(children) > 1: output += '(?:'
                output += op(a, t, newindent)
                if len(children) > 1: output += ')'
            else:
                output += ''
    return output

formula = {}
for answer in set(answers):
    formula[answer] = op(answer, '', 1)
    print(answer, formula[answer])


# win = 0
# lose = 0
# total = 0
# with open(testfile, newline='') as csvfile:
#     reader = csv.reader(csvfile, delimiter=',', quotechar='"')
#     for row in reader:
#         postcode = row[0].replace(' ','')
#         answer = row[1]
#         total += 1

#         results = {}
#         for test in formula:
#             m = re.search(formula[test], postcode)
#             if m:
#                 results[m.end()] = test

#         print(postcode, answer, results)
#         if results.keys() and results[sorted(results.keys())[-1]] == answer:
#             win +=1
#         else:
#             lose +=1

# print(total, win, lose, (win/total)*100)
#!/usr/bin/env python
import csv
import re
import json
import pprint
from collections import defaultdict

testfile = 'small2.csv'

def tree(tree):
    print(json.dumps(dict(tree), indent=3, sort_keys=True))

def merge(a, b, path=None):
    "merges b into a"
    if path is None: path = []
    for key in b:
        if key in a:
            if isinstance(a[key], dict) and isinstance(b[key], dict):
                merge(a[key], b[key], path + [str(key)])
            elif a[key] == b[key]:
                pass # same leaf value
            else:
                raise Exception('Conflict at %s' % '.'.join(path + [str(key)]))
        else:
            a[key] = b[key]
    return a

def count_keys(dict_, counter=0):
    for each_key in dict_:
        if isinstance(dict_[each_key], dict):
            # Recursive call
            counter = count_keys(dict_[each_key], counter + 1)
        else:
            counter += 1
    return counter

results = defaultdict(dict)
size = 0
with open(testfile, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        size += 1
        p = row[0].replace(' ','')
        answer = row[1]
        postdict = eval("{'" + "': {'".join(['%s' % x for x in p]) + "':{}" +(len(['{%s:' % x for x in p]) * "}"))
        results[answer] = merge(results[answer], postdict)

# pprint.pprint(results)
print('%s rows in the sheet' % size)
print('%s keys in tree' % count_keys(results))

tree(results)

def recursive_get(d, keys):
    if len(keys) == 1:
        return d[keys[0]]
    return recursive_get(d[keys[0]], keys[1:])

for answer in results:
    for first in results[answer]:
        # C
        unique = True
        for other in [x for x in results if answer != x]:
            if [x for x in results[other] if x == first]:
                # unique
                unique = False
            if unique:
                results[answer][first] = {}


            # else:
            #     # not unique -> children
            #     for second in results[answer][first]:
            #         if not [y for y in otheranswers if second in results[y][first]]:
            #             # unique
            #             results[answer][first][second] = {}
            #         else:
            #             # not unique -> children
            #             for third in  results[answer][first][second]:
            #                 if not [y for y in otheranswers if third in results[y][first][second]]:
            #                     # unique
            #                     results[answer][first][second][third] = {}
            #                 else:
            #                     # not unique -> children
            #                     for fourth in results[answer][first][second][third]:
            #                         if not [y for y in otheranswers if fourth in results[y][first][second][third]]:
            #                             # unique
            #                             results[answer][first][second][third][fourth] = {}
            #                         else:
            #                             # not unique -> children
            #                             for fifth in results[answer][first][second][third][fourth]:
            #                                 if not [y for y in otheranswers if fifth in results[y][first][second][third][fourth]]:
            #                                     # unique
            #                                     results[answer][first][second][third][fourth][fifth] = {}
            #                                 else:
            #                                     # not unique -> children
            #                                     for sixth in results[answer][first][second][third][fourth][fifth]:
            #                                         if not [y for y in otheranswers if fifth in results[y][first][second][third][fourth][fifth]]:
            #                                             # unique
            #                                             results[answer][first][second][third][fourth][fifth][sixth] = {}


print('%s keys in tree' % count_keys(results))
tree(results)

# ##########

# win = 0
# lose = 0
# total = 0
# with open(testfile, newline='') as csvfile:
#     reader = csv.reader(csvfile, delimiter=',', quotechar='"')
#     for row in reader:
#         postcode = row[0].replace(' ','')
#         answer = row[1]
#         total += 1

#         results = {}
#         for test in out:
#             m = re.search('^('+out[test]+")", postcode)
#             if m: 
#                 end = m.end()
#                 results[end] = test
#         if results[sorted(results.keys())[-1]] == answer:
#             win +=1
#         else:
#             lose +=1
#             # print(postcode, answer, results[sorted(results.keys())[-1]])

# print(total, win, lose, (win/total)*100)

# def walk_dict(d):
#     text = ''
#     for k,v in d.items():
#         if isinstance(v, dict):
#             text += k + "("
#             text += walk_dict(v)
#             text += ')'
#         else:
#             text += "%s %s" % (k, v)
#     return text

# for answer in results:
#     print (answer + ':  ', walk_dict(results[answer]))
#!/usr/bin/env python
import csv
import re
from collections import defaultdict
import pprint


testfile = 'small2.csv'

results = defaultdict(dict)
with open(testfile, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        postcode = row[0].replace(' ','')
        answer = row[1]

        for num, l in enumerate(postcode):
            if num in results[answer]:
                 results[answer][num].append(postcode[0:num+1])
            else:
                results[answer][num] = [postcode[0:num+1]]

original = 0
for answer in results:
    for x in results[answer]:
        results[answer][x] = list(set(results[answer][x]))
        original += len(results[answer][x])

        for part in results[answer][x]:
            unique = True
            for anotheranswer in results:
                if answer != anotheranswer: 
                    if x in results[anotheranswer]:
                        if part in results[anotheranswer][x]:
                            unique = False
                            continue
                else:
                    continue
            if unique:
                for z in range(x+1, len(results[answer])):
                    results[answer][z] = sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][z]))))

print("%s combinations" % original)

print(results)


# items = 0
# for answer in results:
#     for x in range(4):
#         items += len(results[answer][x])

# print("reduced to %s combinations, %s percent reduction" % (items, items/original ))

def recursechildren(answer, counter, part):
    if counter in results[answer]:
        for num, id in enumerate(sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][counter]))))):
            print(results[answer][counter][num].replace(part, ''), end = '')
            if counter+1 in results[answer]:
                print('(', end = '')
                for childpart in sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][counter+1])))):
                    recursechildren(answer, counter+1, childpart)
                print(')', end = '')
            print('|', end = '')


out = {}
for answer in results:
    print(answer, '>>>')
    for num, id in enumerate(results[answer]):
        recursechildren(answer, num, part)


# for r in out:
#     print (r, '>>', out[r])

# ##########

# win = 0
# lose = 0
# total = 0
# with open(testfile, newline='') as csvfile:
#     reader = csv.reader(csvfile, delimiter=',', quotechar='"')
#     for row in reader:
#         postcode = row[0].replace(' ','')
#         answer = row[1]
#         total += 1

#         results = {}
#         for test in out:
#             m = re.search('^('+out[test]+")", postcode)
#             if m: 
#                 end = m.end()
#                 results[end] = test
#         if results[sorted(results.keys())[-1]] == answer:
#             win +=1
#         else:
#             lose +=1
#             # print(postcode, answer, results[sorted(results.keys())[-1]])

# print(total, win, lose, (win/total)*100)
#!/usr/bin/env python
import csv
import re
from collections import defaultdict
import pprint


testfile = 'all.csv'

results = defaultdict(dict)
with open(testfile, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        postcode = row[0]
        answer = row[1]
        pc_parts = re.split('(\d+)',postcode)
        pc_parts.remove(' ')
        if answer not in results:
            results[answer] = {0: [], 1: [], 2: [], 3: []}
        for a in results[answer].keys():
            results[answer][a].append(''.join(pc_parts[0:a+1]))

original = 0
for answer in results:
    for x in range(4):
        results[answer][x] = list(set(results[answer][x]))
        original += len(results[answer][x])

print("%s combinations" % original)

for y in range(3):
    for answer in results:
        for part in results[answer][y]:
            unique = True
            for anotheranswer in results:
                if answer != anotheranswer: 
                    if part in results[anotheranswer][y]:
                        unique = False
                        continue
                else:
                    continue
            if unique:
                for z in range(y+1, 4):
                    results[answer][z] = sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][z]))))


items = 0
for answer in results:
    for x in range(4):
        items += len(results[answer][x])

print("reduced to %s combinations, %s percent reduction" % (items, items/original ))

out = {}
for answer in results:
    out[answer] = ''
    for number, part in enumerate(results[answer][0]):
        if number: out[answer] += '|'
        out[answer] += '^' + part

        if len(results[answer][1]) > 0:
            if len([x for x in results[answer][1] if x.startswith(part)]) > 1: out[answer] += '(?:'

            for number1, part1 in enumerate([x for x in results[answer][1] if x.startswith(part)]):
                if number1: out[answer] += '|'
                out[answer] += part1.replace(part, '')

                if len([x for x in results[answer][2] if x.startswith(part1)]) > 0:
                    out[answer] += "(?:"
                    for number2, part2 in enumerate([x for x in results[answer][2] if x.startswith(part1)]):
                        if number2: out[answer] += '|'
                        out[answer] += part2.replace(part1, '')


                        if len([x for x in results[answer][3] if x.startswith(part2)]):
                            endings = '|'.join([x[-2:] for x in results[answer][3] if x.startswith(part2)])
                            if len(endings):
                                if number2: out[answer] += '|'
                                out[answer] += '(?:' + endings + ')'

                    out[answer] += ")"

            if len([x for x in results[answer][1] if x.startswith(part)]) > 1: out[answer] += ')'


for r in out:
    print (r, '>>', out[r])

##########

win = 0
lose = 0
total = 0
with open(testfile, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        postcode = row[0].replace(' ','')
        answer = row[1]
        total += 1

        results = {}
        for test in out:
            m = re.search('^('+out[test]+")", postcode)
            if m: 
                end = m.end()
                results[end] = test
        if results[sorted(results.keys())[-1]] == answer:
            win +=1
        else:
            lose +=1
            # print(postcode, answer, results[sorted(results.keys())[-1]])

print(total, win, lose, (win/total)*100)
#!/usr/bin/env python
import csv

# results = {}
# with open('all.csv', newline='') as csvfile:
#     reader = csv.reader(csvfile, delimiter=',', quotechar='"')
#     for row in reader:
#         if row[1] in results:
#             if row[0].split(" ")[0] not in results[row[1]]:
#                 results[row[1]].append(row[0].split(" ")[0])
#         else:
#             results[row[1]] = [row[0].split(" ")[0]]

# incodes = []
# dupes = []
# for outcode in results:
#     for incode in results[outcode]:
#         if incode not in incodes:
#             incodes.append(incode)
#         for othercode in results:
#             if othercode != outcode:
#                 if incode in results[othercode]:
#                     if incode not in dupes:
#                         dupes.append(incode)

# print("%s outcodes are in more than one region" % len(dupes))
# print("%s are unique to regions" % (len(incodes) - len(dupes)))
# print("%s all the outcodes" % len(incodes))

# print(results.keys())


results = {}
with open('all.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in reader:
        if row[1] in results:
            if row[0].split(" ")[0] + row[0].split(" ")[1][0] not in results[row[1]]:
                results[row[1]][row[0].split(" ")[0]] = [row[0].split(" ")[0] + row[0].split(" ")[1][0]]
            else:
                results[row[1]][row[0].split(" ")[0]].append(row[0].split(" ")[0] + row[0].split(" ")[1][0])
        else:
            results[row[1]] = {
                row[0].split(" ")[0]: [row[0].split(" ")[0] + row[0].split(" ")[1][0]]
            }
# print(results)

# Find the unique outcodes
for region in results:
    results[region]['values'] = []
    for outcode in results[region]:
        unique = True
        for otherregion in results:
            if outcode in results[otherregion] and region != otherregion:
                unique = False
        if unique:
            results[region]['values'].append(outcode)

# remove uniques we've found
for region in results:
    for outcode in results[region]['values']:
        results[region].pop(outcode, None)

# find the uniques at incode level
for region in results:
    for outcode in list(results[region]):
        unique = True
        if outcode != 'values':
            for otherregion in results:
                for otheroutcode in results[otherregion]:
                    if outcode in results[otherregion][otheroutcode]:
                        unique = False
        if unique:
            if 'values' in results[region]:
                results[region]['values'].append(results[region][outcode][0])
                # results[region].pop(outcode, None)
            else: 
                results[region]['values'] = [results[region][outcode][0]]
                # results[region].pop(outcode, None)


shortlist = {}
for region in results:
    shortlist[region] = '^' + '|'.join(set(results[region]['values']))
print(shortlist)


import re

# verify
with open('all.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    win, lose = 0, 0
    for row in reader:
        pcode =  row[0].replace(' ','')
        result = ''
        for test in shortlist:
            val = re.search(shortlist[test], pcode)

            if val:
                # print(val, test, shortlist[test],  pcode)
                result = test
                break

        # print(pcode, row[1], result, row[1] == result)
        if row[1] == result:
            win += 1
        else: 
            lose += 1
            print(pcode)

# print(win, lose)

ntoll Nicholas Tollervey ~ 27 Nov 2022 9:05 p.m.

Hi Steve, 👋

I'm going to be your mentor for this project, and thank you so much for your work done so far.

I have some initial feedback, and I hope you take the time to update things, push back or ask questions as we refine your project into a state that allows me to give you a final mark:

# Extract and map CCG ids to postcodes contained therein.
for row in reader:
    if row[1] in results:
        if row[0].split(" ")[0] not in results[row[1]]:
            results[row[1]].append(row[0].split(" ")[0])
    else:
        results[row[1]] = [row[0].split(" ")[0]]
results = {}
with open('all.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    # Extract and map CCG ids to postcodes contained therein.
    for postcode, ccg in reader:
        outcode, incode = postcode.split(" ")
        if ccg in results:
            if outcode not in results[ccg]:
                results[ccg].append(outcode)
        else:
            results[ccg] = [outcode, ]
def op(a, s, indent):
    if indent == 1:
        all = [x for x in results if a in results[x] and len(x) == indent]
    else:
        all = [x for x in results if a in results[x] and len(x) == indent and x.startswith(s)]

brum_postcode.png

So, please don't be disheartened by the feedback. Much of it can be rectified by remembering to add comments describing intention, naming things appropriately so the code reads in a meaningful manner, and structuring your Markdown in a way that the story unfolds. All of which can be easily updated with a few edits. 🚀

I'm also getting an impression of what the final project might look like for an end user. A sort of fake government service where you enter your postcode and the service tells you the details of your CCG (contact details, reports, website etc...). This would be a great project to get hosted online somewhere. Perhaps, once the comments, names and Markdown are fixed, this is something to refine towards... perhaps a super-simple Flask app freely hosted on PythonAnywhere, or similar..?

Once I have some source code for an app like this I can check you're covering all the bases for the core concepts you need for grade three, and can feedback and finally give you your result.

Great work so far Steve, keep up the good work and please remember to refine the things you've done already. As always, if you have any questions or clarifications, please don't hesitate to reply here. I'm here to help and support you as this project matures into something epic. 🤗


Back to top