Candidate: Stephen Hawkes Assessed by: Nicholas Tollervey
Python (2023) ~ Grade 3 (Lower)
Regular Postcodes
Postcodes are great, they're a short way to figure out where someone is and can be really useful in lots of contexts. Often someone will need to work out if their postcode falls into one region or another. E.g. I know my postcode, but I don't know my NHS CCG area, or I know my postcode, but I don't know my electoral ward. Often this is solved with a system that uses a database to store all the postcodes in a table with the associated values. Brilliant. Except it sucks for small use cases or for static web projects. In this grade I was curious to see if I could write a project to see if I could convert the list of postcodes into a regular expression (or something similar) so that a static process could determine the region without a database lookup.
Attached files
Filename (click to download) | Size | Uploaded by |
---|
pcodey60.zip | 1.0 MB | hawkz |
Markdown code
[pcodey60.zip](/media/assessment/86a9f52f/2022-03-31/09-19-49/pcodey60.zip){target="_blank"}
run.py | 709 bytes | hawkz |
Markdown code
```Python
#!/usr/bin/env python
import csv
results = {}
with open('all.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
if row[1] in results:
if row[0].split(" ")[0] not in results[row[1]]:
results[row[1]].append(row[0].split(" ")[0])
else:
results[row[1]] = [row[0].split(" ")[0]]
print(results.keys())
print(results)
# outward = {}
# with open('all.csv') as f:
# for line in f:
# code = line.strip().split(" ")
# if code[0] not in outward:
# outward[code[0]] = [code[1]]
# else:
# outward[code[0]].append(code[1])
# print(outward.keys())
```
all.csv | 6.2 MB | hawkz |
Markdown code
[all.csv](/media/assessment/86a9f52f/2022-03-31/19-58-11/all.csv){target="_blank"}
all.csv | 6.2 MB | hawkz |
Markdown code
[all.csv](/media/assessment/86a9f52f/2022-04-11/08-00-18/all.csv){target="_blank"}
run5.py | 2.3 KB | hawkz |
Markdown code
```Python
#!/usr/bin/env python
import csv
import re
testfile = 'all.csv'
results = {}
size = 0
answers = []
with open(testfile, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
size += 1
p = row[0].replace(' ','')
answer = row[1]
answers.append(answer)
for a in range(1,len(p)+1):
if p[0:a] in results:
if answer not in results[p[0:a]]:
results[p[0:a]].append(answer)
else:
results[p[0:a]] = [answer]
remove = []
for x in set(results):
if x[:-1] in results:
if len(results[x[:-1]]) == 1:
remove.append(x)
for x in set(remove):
del results[x]
# print('%s total results' % size)
# print('%s new total results' % len(results))
def op(a, s, indent):
if indent == 1:
all = [x for x in results if a in results[x] and len(x) == indent]
else:
all = [x for x in results if a in results[x] and len(x) == indent and x.startswith(s)]
output = ''
for num, t in enumerate(all):
if num: output += '|'
if indent == 1: output += '^'
output += t[-1]
if indent <= 7:
newindent = indent + 1
children = [x for x in results if a in results[x] and len(x) == newindent and x.startswith(t)]
if children:
if len(children) > 1: output += '(?:'
output += op(a, t, newindent)
if len(children) > 1: output += ')'
else:
output += ''
return output
formula = {}
for answer in set(answers):
formula[answer] = op(answer, '', 1)
print(answer, formula[answer])
# win = 0
# lose = 0
# total = 0
# with open(testfile, newline='') as csvfile:
# reader = csv.reader(csvfile, delimiter=',', quotechar='"')
# for row in reader:
# postcode = row[0].replace(' ','')
# answer = row[1]
# total += 1
# results = {}
# for test in formula:
# m = re.search(formula[test], postcode)
# if m:
# results[m.end()] = test
# print(postcode, answer, results)
# if results.keys() and results[sorted(results.keys())[-1]] == answer:
# win +=1
# else:
# lose +=1
# print(total, win, lose, (win/total)*100)
```
run4.py | 5.2 KB | hawkz |
Markdown code
```Python
#!/usr/bin/env python
import csv
import re
import json
import pprint
from collections import defaultdict
testfile = 'small2.csv'
def tree(tree):
print(json.dumps(dict(tree), indent=3, sort_keys=True))
def merge(a, b, path=None):
"merges b into a"
if path is None: path = []
for key in b:
if key in a:
if isinstance(a[key], dict) and isinstance(b[key], dict):
merge(a[key], b[key], path + [str(key)])
elif a[key] == b[key]:
pass # same leaf value
else:
raise Exception('Conflict at %s' % '.'.join(path + [str(key)]))
else:
a[key] = b[key]
return a
def count_keys(dict_, counter=0):
for each_key in dict_:
if isinstance(dict_[each_key], dict):
# Recursive call
counter = count_keys(dict_[each_key], counter + 1)
else:
counter += 1
return counter
results = defaultdict(dict)
size = 0
with open(testfile, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
size += 1
p = row[0].replace(' ','')
answer = row[1]
postdict = eval("{'" + "': {'".join(['%s' % x for x in p]) + "':{}" +(len(['{%s:' % x for x in p]) * "}"))
results[answer] = merge(results[answer], postdict)
# pprint.pprint(results)
print('%s rows in the sheet' % size)
print('%s keys in tree' % count_keys(results))
tree(results)
def recursive_get(d, keys):
if len(keys) == 1:
return d[keys[0]]
return recursive_get(d[keys[0]], keys[1:])
for answer in results:
for first in results[answer]:
# C
unique = True
for other in [x for x in results if answer != x]:
if [x for x in results[other] if x == first]:
# unique
unique = False
if unique:
results[answer][first] = {}
# else:
# # not unique -> children
# for second in results[answer][first]:
# if not [y for y in otheranswers if second in results[y][first]]:
# # unique
# results[answer][first][second] = {}
# else:
# # not unique -> children
# for third in results[answer][first][second]:
# if not [y for y in otheranswers if third in results[y][first][second]]:
# # unique
# results[answer][first][second][third] = {}
# else:
# # not unique -> children
# for fourth in results[answer][first][second][third]:
# if not [y for y in otheranswers if fourth in results[y][first][second][third]]:
# # unique
# results[answer][first][second][third][fourth] = {}
# else:
# # not unique -> children
# for fifth in results[answer][first][second][third][fourth]:
# if not [y for y in otheranswers if fifth in results[y][first][second][third][fourth]]:
# # unique
# results[answer][first][second][third][fourth][fifth] = {}
# else:
# # not unique -> children
# for sixth in results[answer][first][second][third][fourth][fifth]:
# if not [y for y in otheranswers if fifth in results[y][first][second][third][fourth][fifth]]:
# # unique
# results[answer][first][second][third][fourth][fifth][sixth] = {}
print('%s keys in tree' % count_keys(results))
tree(results)
# ##########
# win = 0
# lose = 0
# total = 0
# with open(testfile, newline='') as csvfile:
# reader = csv.reader(csvfile, delimiter=',', quotechar='"')
# for row in reader:
# postcode = row[0].replace(' ','')
# answer = row[1]
# total += 1
# results = {}
# for test in out:
# m = re.search('^('+out[test]+")", postcode)
# if m:
# end = m.end()
# results[end] = test
# if results[sorted(results.keys())[-1]] == answer:
# win +=1
# else:
# lose +=1
# # print(postcode, answer, results[sorted(results.keys())[-1]])
# print(total, win, lose, (win/total)*100)
# def walk_dict(d):
# text = ''
# for k,v in d.items():
# if isinstance(v, dict):
# text += k + "("
# text += walk_dict(v)
# text += ')'
# else:
# text += "%s %s" % (k, v)
# return text
# for answer in results:
# print (answer + ': ', walk_dict(results[answer]))
```
run3.py | 3.1 KB | hawkz |
Markdown code
```Python
#!/usr/bin/env python
import csv
import re
from collections import defaultdict
import pprint
testfile = 'small2.csv'
results = defaultdict(dict)
with open(testfile, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
postcode = row[0].replace(' ','')
answer = row[1]
for num, l in enumerate(postcode):
if num in results[answer]:
results[answer][num].append(postcode[0:num+1])
else:
results[answer][num] = [postcode[0:num+1]]
original = 0
for answer in results:
for x in results[answer]:
results[answer][x] = list(set(results[answer][x]))
original += len(results[answer][x])
for part in results[answer][x]:
unique = True
for anotheranswer in results:
if answer != anotheranswer:
if x in results[anotheranswer]:
if part in results[anotheranswer][x]:
unique = False
continue
else:
continue
if unique:
for z in range(x+1, len(results[answer])):
results[answer][z] = sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][z]))))
print("%s combinations" % original)
print(results)
# items = 0
# for answer in results:
# for x in range(4):
# items += len(results[answer][x])
# print("reduced to %s combinations, %s percent reduction" % (items, items/original ))
def recursechildren(answer, counter, part):
if counter in results[answer]:
for num, id in enumerate(sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][counter]))))):
print(results[answer][counter][num].replace(part, ''), end = '')
if counter+1 in results[answer]:
print('(', end = '')
for childpart in sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][counter+1])))):
recursechildren(answer, counter+1, childpart)
print(')', end = '')
print('|', end = '')
out = {}
for answer in results:
print(answer, '>>>')
for num, id in enumerate(results[answer]):
recursechildren(answer, num, part)
# for r in out:
# print (r, '>>', out[r])
# ##########
# win = 0
# lose = 0
# total = 0
# with open(testfile, newline='') as csvfile:
# reader = csv.reader(csvfile, delimiter=',', quotechar='"')
# for row in reader:
# postcode = row[0].replace(' ','')
# answer = row[1]
# total += 1
# results = {}
# for test in out:
# m = re.search('^('+out[test]+")", postcode)
# if m:
# end = m.end()
# results[end] = test
# if results[sorted(results.keys())[-1]] == answer:
# win +=1
# else:
# lose +=1
# # print(postcode, answer, results[sorted(results.keys())[-1]])
# print(total, win, lose, (win/total)*100)
```
run2.py | 3.6 KB | hawkz |
Markdown code
```Python
#!/usr/bin/env python
import csv
import re
from collections import defaultdict
import pprint
testfile = 'all.csv'
results = defaultdict(dict)
with open(testfile, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
postcode = row[0]
answer = row[1]
pc_parts = re.split('(\d+)',postcode)
pc_parts.remove(' ')
if answer not in results:
results[answer] = {0: [], 1: [], 2: [], 3: []}
for a in results[answer].keys():
results[answer][a].append(''.join(pc_parts[0:a+1]))
original = 0
for answer in results:
for x in range(4):
results[answer][x] = list(set(results[answer][x]))
original += len(results[answer][x])
print("%s combinations" % original)
for y in range(3):
for answer in results:
for part in results[answer][y]:
unique = True
for anotheranswer in results:
if answer != anotheranswer:
if part in results[anotheranswer][y]:
unique = False
continue
else:
continue
if unique:
for z in range(y+1, 4):
results[answer][z] = sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][z]))))
items = 0
for answer in results:
for x in range(4):
items += len(results[answer][x])
print("reduced to %s combinations, %s percent reduction" % (items, items/original ))
out = {}
for answer in results:
out[answer] = ''
for number, part in enumerate(results[answer][0]):
if number: out[answer] += '|'
out[answer] += '^' + part
if len(results[answer][1]) > 0:
if len([x for x in results[answer][1] if x.startswith(part)]) > 1: out[answer] += '(?:'
for number1, part1 in enumerate([x for x in results[answer][1] if x.startswith(part)]):
if number1: out[answer] += '|'
out[answer] += part1.replace(part, '')
if len([x for x in results[answer][2] if x.startswith(part1)]) > 0:
out[answer] += "(?:"
for number2, part2 in enumerate([x for x in results[answer][2] if x.startswith(part1)]):
if number2: out[answer] += '|'
out[answer] += part2.replace(part1, '')
if len([x for x in results[answer][3] if x.startswith(part2)]):
endings = '|'.join([x[-2:] for x in results[answer][3] if x.startswith(part2)])
if len(endings):
if number2: out[answer] += '|'
out[answer] += '(?:' + endings + ')'
out[answer] += ")"
if len([x for x in results[answer][1] if x.startswith(part)]) > 1: out[answer] += ')'
for r in out:
print (r, '>>', out[r])
##########
win = 0
lose = 0
total = 0
with open(testfile, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
postcode = row[0].replace(' ','')
answer = row[1]
total += 1
results = {}
for test in out:
m = re.search('^('+out[test]+")", postcode)
if m:
end = m.end()
results[end] = test
if results[sorted(results.keys())[-1]] == answer:
win +=1
else:
lose +=1
# print(postcode, answer, results[sorted(results.keys())[-1]])
print(total, win, lose, (win/total)*100)
```
run.py | 3.6 KB | hawkz |
Markdown code
```Python
#!/usr/bin/env python
import csv
# results = {}
# with open('all.csv', newline='') as csvfile:
# reader = csv.reader(csvfile, delimiter=',', quotechar='"')
# for row in reader:
# if row[1] in results:
# if row[0].split(" ")[0] not in results[row[1]]:
# results[row[1]].append(row[0].split(" ")[0])
# else:
# results[row[1]] = [row[0].split(" ")[0]]
# incodes = []
# dupes = []
# for outcode in results:
# for incode in results[outcode]:
# if incode not in incodes:
# incodes.append(incode)
# for othercode in results:
# if othercode != outcode:
# if incode in results[othercode]:
# if incode not in dupes:
# dupes.append(incode)
# print("%s outcodes are in more than one region" % len(dupes))
# print("%s are unique to regions" % (len(incodes) - len(dupes)))
# print("%s all the outcodes" % len(incodes))
# print(results.keys())
results = {}
with open('all.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
if row[1] in results:
if row[0].split(" ")[0] + row[0].split(" ")[1][0] not in results[row[1]]:
results[row[1]][row[0].split(" ")[0]] = [row[0].split(" ")[0] + row[0].split(" ")[1][0]]
else:
results[row[1]][row[0].split(" ")[0]].append(row[0].split(" ")[0] + row[0].split(" ")[1][0])
else:
results[row[1]] = {
row[0].split(" ")[0]: [row[0].split(" ")[0] + row[0].split(" ")[1][0]]
}
# print(results)
# Find the unique outcodes
for region in results:
results[region]['values'] = []
for outcode in results[region]:
unique = True
for otherregion in results:
if outcode in results[otherregion] and region != otherregion:
unique = False
if unique:
results[region]['values'].append(outcode)
# remove uniques we've found
for region in results:
for outcode in results[region]['values']:
results[region].pop(outcode, None)
# find the uniques at incode level
for region in results:
for outcode in list(results[region]):
unique = True
if outcode != 'values':
for otherregion in results:
for otheroutcode in results[otherregion]:
if outcode in results[otherregion][otheroutcode]:
unique = False
if unique:
if 'values' in results[region]:
results[region]['values'].append(results[region][outcode][0])
# results[region].pop(outcode, None)
else:
results[region]['values'] = [results[region][outcode][0]]
# results[region].pop(outcode, None)
shortlist = {}
for region in results:
shortlist[region] = '^' + '|'.join(set(results[region]['values']))
print(shortlist)
import re
# verify
with open('all.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
win, lose = 0, 0
for row in reader:
pcode = row[0].replace(' ','')
result = ''
for test in shortlist:
val = re.search(shortlist[test], pcode)
if val:
# print(val, test, shortlist[test], pcode)
result = test
break
# print(pcode, row[1], result, row[1] == result)
if row[1] == result:
win += 1
else:
lose += 1
print(pcode)
# print(win, lose)
```
brum_postcode.png | 58.8 KB | ntoll |
Markdown code
![brum_postcode.png](/media/assessment/86a9f52f/2022-11-27/20-40-22/brum_postcode.png "brum_postcode.png")
Status: Submitted for assessment.
Stephen Hawkes ~ 31 Mar 2022 9:21 a.m.
pcodey60.zip I have the NHS database for CCG areas. I wonder if I could change this from a huge CSV to a smaller way to lookup the data. Perhaps in a Python dictionary
E.g. {"^B(.*)": "15E" }
This would mean any postcodes starting B, would give the 15E (Birmingham and Solihull CCG) region ID.
Wish me luck! Or enjoy the madness that follows.
Stephen Hawkes ~ 31 Mar 2022 7:25 p.m.
Each postcode consists of two parts. The first part is the outward postcode, or outcode. This is separated by a single space from the second part, which is the inward postcode, or in-code.
The outward postcode enables mail to be sent to the correct local area for delivery. This part of the code contains the area and the district to which the mail is to be delivered.
The inward postcode is used to sort the mail at the local area delivery office. It consists of a numeric character followed by two alphabetic characters. The numeric character identifies the sector within the postal district. The alphabetic characters then define one or more properties within the sector.
For example: PO1 3AX PO refers to the postcode area of Portsmouth. PO1 refers to a postcode district within the postcode area of Portsmouth PO1 3 refers to the postcode sector. PO1 3AX. The AX completes the postcode. The last two letters define the ‘unit postcode’ which identifies one or more small user delivery points or an individual large user
Stephen Hawkes ~ 31 Mar 2022 8:04 p.m.
Here's a quick demo that I built to try my theory out. It parses the CSV file with all the UK postcodes in, and the NHS CCG regions. Then I output the keys and whole dict to understand what reduction there is if any.
#!/usr/bin/env python
import csv
results = {}
with open('all.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
if row[1] in results:
if row[0].split(" ")[0] not in results[row[1]]:
results[row[1]].append(row[0].split(" ")[0])
else:
results[row[1]] = [row[0].split(" ")[0]]
print(results.keys())
print(results)
So.. as a quick test we can make a lot of saving from just the outward part of the postcode for the CCG regions. B13 only appears in the 15E region, so all B13 postcodes are in Birmingham and Solihull CCG. However B14 appears in more than region so I need to measure how unique the outcodes are next. If this had worked uniquely I could have reduced the need for a 6M CSV to a 5K text file.
Stephen Hawkes ~ 31 Mar 2022 8:58 p.m. (updated: 31 Mar 2022 9:27 p.m.)
#!/usr/bin/env python
import csv
results = {}
with open('all.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
if row[1] in results:
if row[0].split(" ")[0] not in results[row[1]]:
results[row[1]].append(row[0].split(" ")[0])
else:
results[row[1]] = [row[0].split(" ")[0]]
incodes = []
dupes = []
for outcode in results:
for incode in results[outcode]:
if incode not in incodes:
incodes.append(incode)
for othercode in results:
if othercode != outcode:
if incode in results[othercode]:
if incode not in dupes:
dupes.append(incode)
print("%s outcodes are in more than one region" % len(dupes))
print("%s are unique to regions" % (len(incodes) - len(dupes)))
print("%s all the outcodes" % len(incodes))
Here I've found the following: 150 outcodes are in more than one region 272 are unique to regions 422 all the outcodes
So 64% of regions can be found by the outcode alone.
Stephen Hawkes ~ 11 Apr 2022 8:02 a.m.
After exploring further I was able to create a working algorithm to compress the postcodes into a regular expression:
#!/usr/bin/env python
import csv
import re
testfile = 'all.csv'
results = {}
size = 0
answers = []
with open(testfile, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
size += 1
p = row[0].replace(' ','')
answer = row[1]
answers.append(answer)
for a in range(1,len(p)+1):
if p[0:a] in results:
if answer not in results[p[0:a]]:
results[p[0:a]].append(answer)
else:
results[p[0:a]] = [answer]
remove = []
for x in set(results):
if x[:-1] in results:
if len(results[x[:-1]]) == 1:
remove.append(x)
for x in set(remove):
del results[x]
def op(a, s, indent):
if indent == 1:
all = [x for x in results if a in results[x] and len(x) == indent]
else:
all = [x for x in results if a in results[x] and len(x) == indent and x.startswith(s)]
output = ''
for num, t in enumerate(all):
if num: output += '|'
if indent == 1: output += '^'
output += t[-1]
if indent <= 7:
newindent = indent + 1
children = [x for x in results if a in results[x] and len(x) == newindent and x.startswith(t)]
if children:
if len(children) > 1: output += '(?:'
output += op(a, t, newindent)
if len(children) > 1: output += ')'
return output
formula = {}
for answer in set(answers):
formula[answer] = op(answer, '', 1)
print(answer, formula[answer])
win = 0
lose = 0
total = 0
with open(testfile, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
postcode = row[0].replace(' ','')
answer = row[1]
total += 1
results = {}
for test in formula:
m = re.search(formula[test], postcode)
if m:
results[m.end()] = test
if results.keys() and results[sorted(results.keys())[-1]] == answer:
win +=1
else:
lose +=1
print(total, win, lose, (win/total)*100)
With the process above I can get a 6.6MB file to compress to 67KB regular expressions, with 100% of the codes being tested. all.csv
Note: There's one caveat in that regex matching is greedy, so you have to test which regular expression matches the most characters of the postcode in the case of similar codes. However this works really quickly once the the regex strings have been produced.
Stephen Hawkes ~ 11 Apr 2022 8:30 a.m.
With the regular expressions built, I'm now able to validate postcodes from a region with needing a database.
https://codepen.io/hawkz/pen/qBpKzaL
6.6MB database turned into a few kilobytes!
Stephen Hawkes ~ 11 Apr 2022 8:40 a.m.
Bumps along the way...
I had a few issues with different attempts, a combination of trying to avoid functions for tree traversal, and relearning regular expressions.
One gotcha was in regular expressions, where parts of postcode had more than on character. In the regular expressions I was generating [B|CV]
thinking that was B or CV, but instead processed as BV or CV. I also experimented with different ways to process the CSV to the outputs minimising or maximising the amount of data stored. I numbered each iteration, they're all here.
Most of this uses the standard library so it should be a case of python run.py
a couple of them used tdqm, which needs pip install tdqm
in your virtual environment to work.
#!/usr/bin/env python
import csv
import re
testfile = 'all.csv'
results = {}
size = 0
answers = []
with open(testfile, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
size += 1
p = row[0].replace(' ','')
answer = row[1]
answers.append(answer)
for a in range(1,len(p)+1):
if p[0:a] in results:
if answer not in results[p[0:a]]:
results[p[0:a]].append(answer)
else:
results[p[0:a]] = [answer]
remove = []
for x in set(results):
if x[:-1] in results:
if len(results[x[:-1]]) == 1:
remove.append(x)
for x in set(remove):
del results[x]
# print('%s total results' % size)
# print('%s new total results' % len(results))
def op(a, s, indent):
if indent == 1:
all = [x for x in results if a in results[x] and len(x) == indent]
else:
all = [x for x in results if a in results[x] and len(x) == indent and x.startswith(s)]
output = ''
for num, t in enumerate(all):
if num: output += '|'
if indent == 1: output += '^'
output += t[-1]
if indent <= 7:
newindent = indent + 1
children = [x for x in results if a in results[x] and len(x) == newindent and x.startswith(t)]
if children:
if len(children) > 1: output += '(?:'
output += op(a, t, newindent)
if len(children) > 1: output += ')'
else:
output += ''
return output
formula = {}
for answer in set(answers):
formula[answer] = op(answer, '', 1)
print(answer, formula[answer])
# win = 0
# lose = 0
# total = 0
# with open(testfile, newline='') as csvfile:
# reader = csv.reader(csvfile, delimiter=',', quotechar='"')
# for row in reader:
# postcode = row[0].replace(' ','')
# answer = row[1]
# total += 1
# results = {}
# for test in formula:
# m = re.search(formula[test], postcode)
# if m:
# results[m.end()] = test
# print(postcode, answer, results)
# if results.keys() and results[sorted(results.keys())[-1]] == answer:
# win +=1
# else:
# lose +=1
# print(total, win, lose, (win/total)*100)
#!/usr/bin/env python
import csv
import re
import json
import pprint
from collections import defaultdict
testfile = 'small2.csv'
def tree(tree):
print(json.dumps(dict(tree), indent=3, sort_keys=True))
def merge(a, b, path=None):
"merges b into a"
if path is None: path = []
for key in b:
if key in a:
if isinstance(a[key], dict) and isinstance(b[key], dict):
merge(a[key], b[key], path + [str(key)])
elif a[key] == b[key]:
pass # same leaf value
else:
raise Exception('Conflict at %s' % '.'.join(path + [str(key)]))
else:
a[key] = b[key]
return a
def count_keys(dict_, counter=0):
for each_key in dict_:
if isinstance(dict_[each_key], dict):
# Recursive call
counter = count_keys(dict_[each_key], counter + 1)
else:
counter += 1
return counter
results = defaultdict(dict)
size = 0
with open(testfile, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
size += 1
p = row[0].replace(' ','')
answer = row[1]
postdict = eval("{'" + "': {'".join(['%s' % x for x in p]) + "':{}" +(len(['{%s:' % x for x in p]) * "}"))
results[answer] = merge(results[answer], postdict)
# pprint.pprint(results)
print('%s rows in the sheet' % size)
print('%s keys in tree' % count_keys(results))
tree(results)
def recursive_get(d, keys):
if len(keys) == 1:
return d[keys[0]]
return recursive_get(d[keys[0]], keys[1:])
for answer in results:
for first in results[answer]:
# C
unique = True
for other in [x for x in results if answer != x]:
if [x for x in results[other] if x == first]:
# unique
unique = False
if unique:
results[answer][first] = {}
# else:
# # not unique -> children
# for second in results[answer][first]:
# if not [y for y in otheranswers if second in results[y][first]]:
# # unique
# results[answer][first][second] = {}
# else:
# # not unique -> children
# for third in results[answer][first][second]:
# if not [y for y in otheranswers if third in results[y][first][second]]:
# # unique
# results[answer][first][second][third] = {}
# else:
# # not unique -> children
# for fourth in results[answer][first][second][third]:
# if not [y for y in otheranswers if fourth in results[y][first][second][third]]:
# # unique
# results[answer][first][second][third][fourth] = {}
# else:
# # not unique -> children
# for fifth in results[answer][first][second][third][fourth]:
# if not [y for y in otheranswers if fifth in results[y][first][second][third][fourth]]:
# # unique
# results[answer][first][second][third][fourth][fifth] = {}
# else:
# # not unique -> children
# for sixth in results[answer][first][second][third][fourth][fifth]:
# if not [y for y in otheranswers if fifth in results[y][first][second][third][fourth][fifth]]:
# # unique
# results[answer][first][second][third][fourth][fifth][sixth] = {}
print('%s keys in tree' % count_keys(results))
tree(results)
# ##########
# win = 0
# lose = 0
# total = 0
# with open(testfile, newline='') as csvfile:
# reader = csv.reader(csvfile, delimiter=',', quotechar='"')
# for row in reader:
# postcode = row[0].replace(' ','')
# answer = row[1]
# total += 1
# results = {}
# for test in out:
# m = re.search('^('+out[test]+")", postcode)
# if m:
# end = m.end()
# results[end] = test
# if results[sorted(results.keys())[-1]] == answer:
# win +=1
# else:
# lose +=1
# # print(postcode, answer, results[sorted(results.keys())[-1]])
# print(total, win, lose, (win/total)*100)
# def walk_dict(d):
# text = ''
# for k,v in d.items():
# if isinstance(v, dict):
# text += k + "("
# text += walk_dict(v)
# text += ')'
# else:
# text += "%s %s" % (k, v)
# return text
# for answer in results:
# print (answer + ': ', walk_dict(results[answer]))
#!/usr/bin/env python
import csv
import re
from collections import defaultdict
import pprint
testfile = 'small2.csv'
results = defaultdict(dict)
with open(testfile, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
postcode = row[0].replace(' ','')
answer = row[1]
for num, l in enumerate(postcode):
if num in results[answer]:
results[answer][num].append(postcode[0:num+1])
else:
results[answer][num] = [postcode[0:num+1]]
original = 0
for answer in results:
for x in results[answer]:
results[answer][x] = list(set(results[answer][x]))
original += len(results[answer][x])
for part in results[answer][x]:
unique = True
for anotheranswer in results:
if answer != anotheranswer:
if x in results[anotheranswer]:
if part in results[anotheranswer][x]:
unique = False
continue
else:
continue
if unique:
for z in range(x+1, len(results[answer])):
results[answer][z] = sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][z]))))
print("%s combinations" % original)
print(results)
# items = 0
# for answer in results:
# for x in range(4):
# items += len(results[answer][x])
# print("reduced to %s combinations, %s percent reduction" % (items, items/original ))
def recursechildren(answer, counter, part):
if counter in results[answer]:
for num, id in enumerate(sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][counter]))))):
print(results[answer][counter][num].replace(part, ''), end = '')
if counter+1 in results[answer]:
print('(', end = '')
for childpart in sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][counter+1])))):
recursechildren(answer, counter+1, childpart)
print(')', end = '')
print('|', end = '')
out = {}
for answer in results:
print(answer, '>>>')
for num, id in enumerate(results[answer]):
recursechildren(answer, num, part)
# for r in out:
# print (r, '>>', out[r])
# ##########
# win = 0
# lose = 0
# total = 0
# with open(testfile, newline='') as csvfile:
# reader = csv.reader(csvfile, delimiter=',', quotechar='"')
# for row in reader:
# postcode = row[0].replace(' ','')
# answer = row[1]
# total += 1
# results = {}
# for test in out:
# m = re.search('^('+out[test]+")", postcode)
# if m:
# end = m.end()
# results[end] = test
# if results[sorted(results.keys())[-1]] == answer:
# win +=1
# else:
# lose +=1
# # print(postcode, answer, results[sorted(results.keys())[-1]])
# print(total, win, lose, (win/total)*100)
#!/usr/bin/env python
import csv
import re
from collections import defaultdict
import pprint
testfile = 'all.csv'
results = defaultdict(dict)
with open(testfile, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
postcode = row[0]
answer = row[1]
pc_parts = re.split('(\d+)',postcode)
pc_parts.remove(' ')
if answer not in results:
results[answer] = {0: [], 1: [], 2: [], 3: []}
for a in results[answer].keys():
results[answer][a].append(''.join(pc_parts[0:a+1]))
original = 0
for answer in results:
for x in range(4):
results[answer][x] = list(set(results[answer][x]))
original += len(results[answer][x])
print("%s combinations" % original)
for y in range(3):
for answer in results:
for part in results[answer][y]:
unique = True
for anotheranswer in results:
if answer != anotheranswer:
if part in results[anotheranswer][y]:
unique = False
continue
else:
continue
if unique:
for z in range(y+1, 4):
results[answer][z] = sorted(list(filter(lambda x: not x.startswith(part), set(results[answer][z]))))
items = 0
for answer in results:
for x in range(4):
items += len(results[answer][x])
print("reduced to %s combinations, %s percent reduction" % (items, items/original ))
out = {}
for answer in results:
out[answer] = ''
for number, part in enumerate(results[answer][0]):
if number: out[answer] += '|'
out[answer] += '^' + part
if len(results[answer][1]) > 0:
if len([x for x in results[answer][1] if x.startswith(part)]) > 1: out[answer] += '(?:'
for number1, part1 in enumerate([x for x in results[answer][1] if x.startswith(part)]):
if number1: out[answer] += '|'
out[answer] += part1.replace(part, '')
if len([x for x in results[answer][2] if x.startswith(part1)]) > 0:
out[answer] += "(?:"
for number2, part2 in enumerate([x for x in results[answer][2] if x.startswith(part1)]):
if number2: out[answer] += '|'
out[answer] += part2.replace(part1, '')
if len([x for x in results[answer][3] if x.startswith(part2)]):
endings = '|'.join([x[-2:] for x in results[answer][3] if x.startswith(part2)])
if len(endings):
if number2: out[answer] += '|'
out[answer] += '(?:' + endings + ')'
out[answer] += ")"
if len([x for x in results[answer][1] if x.startswith(part)]) > 1: out[answer] += ')'
for r in out:
print (r, '>>', out[r])
##########
win = 0
lose = 0
total = 0
with open(testfile, newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
postcode = row[0].replace(' ','')
answer = row[1]
total += 1
results = {}
for test in out:
m = re.search('^('+out[test]+")", postcode)
if m:
end = m.end()
results[end] = test
if results[sorted(results.keys())[-1]] == answer:
win +=1
else:
lose +=1
# print(postcode, answer, results[sorted(results.keys())[-1]])
print(total, win, lose, (win/total)*100)
#!/usr/bin/env python
import csv
# results = {}
# with open('all.csv', newline='') as csvfile:
# reader = csv.reader(csvfile, delimiter=',', quotechar='"')
# for row in reader:
# if row[1] in results:
# if row[0].split(" ")[0] not in results[row[1]]:
# results[row[1]].append(row[0].split(" ")[0])
# else:
# results[row[1]] = [row[0].split(" ")[0]]
# incodes = []
# dupes = []
# for outcode in results:
# for incode in results[outcode]:
# if incode not in incodes:
# incodes.append(incode)
# for othercode in results:
# if othercode != outcode:
# if incode in results[othercode]:
# if incode not in dupes:
# dupes.append(incode)
# print("%s outcodes are in more than one region" % len(dupes))
# print("%s are unique to regions" % (len(incodes) - len(dupes)))
# print("%s all the outcodes" % len(incodes))
# print(results.keys())
results = {}
with open('all.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
if row[1] in results:
if row[0].split(" ")[0] + row[0].split(" ")[1][0] not in results[row[1]]:
results[row[1]][row[0].split(" ")[0]] = [row[0].split(" ")[0] + row[0].split(" ")[1][0]]
else:
results[row[1]][row[0].split(" ")[0]].append(row[0].split(" ")[0] + row[0].split(" ")[1][0])
else:
results[row[1]] = {
row[0].split(" ")[0]: [row[0].split(" ")[0] + row[0].split(" ")[1][0]]
}
# print(results)
# Find the unique outcodes
for region in results:
results[region]['values'] = []
for outcode in results[region]:
unique = True
for otherregion in results:
if outcode in results[otherregion] and region != otherregion:
unique = False
if unique:
results[region]['values'].append(outcode)
# remove uniques we've found
for region in results:
for outcode in results[region]['values']:
results[region].pop(outcode, None)
# find the uniques at incode level
for region in results:
for outcode in list(results[region]):
unique = True
if outcode != 'values':
for otherregion in results:
for otheroutcode in results[otherregion]:
if outcode in results[otherregion][otheroutcode]:
unique = False
if unique:
if 'values' in results[region]:
results[region]['values'].append(results[region][outcode][0])
# results[region].pop(outcode, None)
else:
results[region]['values'] = [results[region][outcode][0]]
# results[region].pop(outcode, None)
shortlist = {}
for region in results:
shortlist[region] = '^' + '|'.join(set(results[region]['values']))
print(shortlist)
import re
# verify
with open('all.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
win, lose = 0, 0
for row in reader:
pcode = row[0].replace(' ','')
result = ''
for test in shortlist:
val = re.search(shortlist[test], pcode)
if val:
# print(val, test, shortlist[test], pcode)
result = test
break
# print(pcode, row[1], result, row[1] == result)
if row[1] == result:
win += 1
else:
lose += 1
print(pcode)
# print(win, lose)
Nicholas Tollervey ~ 27 Nov 2022 9:05 p.m.
Hi Steve, 👋
I'm going to be your mentor for this project, and thank you so much for your work done so far.
I have some initial feedback, and I hope you take the time to update things, push back or ask questions as we refine your project into a state that allows me to give you a final mark:
- I've read the project description, but I'm not quite sure what your project is supposed to do (it feels like you're having fun exploring a dataset - not a bad thing - but I'm not sure I understand what you hope to achieve). This can (and should) become clearer as we refine the work you've done so far. To quote you, "a short way to figure out where someone is [...] can be really useful in lots of contexts", so we need to zone in on such a context that is meaningful to a certain person[s].
- As a UK based person 🇬🇧 I understand some of the terminology in the description. But imagine I'm from elsewhere in the world, and I have no idea what
NHS
,CCG
,postcode
orelectoral ward
might mean. Could you please adjust the description to make it clear what's going on to a non-UK resident..? 👍 - Please feel free to edit and adjust the project entries you've made so far. Some of them are missing punctuation (sorry for being so teacher-ish 🎓) and could perhaps be re-organised in Markdown so the story of your project unfolds in a smoother fashion. I hope you agree, it's important to communicate clearly about coding..! 😁 Which nicely segues into....
- I love the description of the parts of a postcode. I'd always wondered. It's nice and clear, and very engaging.
- Your code doesn't include comments. 😄 While I can make sense of it, I would do so quicker and with a greater sense of the context of what you're doing if you annotated your code with descriptions of your intentions. For instance:
# Extract and map CCG ids to postcodes contained therein.
for row in reader:
if row[1] in results:
if row[0].split(" ")[0] not in results[row[1]]:
results[row[1]].append(row[0].split(" ")[0])
else:
results[row[1]] = [row[0].split(" ")[0]]
- Also, when iterating over rows, it is possible to unpack each item in the row into a meaningful name (for example, if you always have two values in a row, you can assign each to a named object, as shown below). It makes the code much clearer to read and easier to understand. For example, the fragment above could be re-written as:
results = {}
with open('all.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
# Extract and map CCG ids to postcodes contained therein.
for postcode, ccg in reader:
outcode, incode = postcode.split(" ")
if ccg in results:
if outcode not in results[ccg]:
results[ccg].append(outcode)
else:
results[ccg] = [outcode, ]
- Also, why not use a
set
rather than alist
as the value into which you store theoutcode
..? Can you see why this would be better..? - Bravo on reducing a 6M CSV to a 5k text file (although you don't appear to show how you do this, the first time you mention it... see my previous point about re-ordering, editing your entries so they tell the story of the project most effectively).
- Regarding this comment, I'm not sure I quite follow all the code in there (all the nested
for
s andif
s with no comments). Can you add comments or docs to explain your intention? I'm also not exactly sure what you mean by,Here I've found the following: 150 outcodes are in more than one region 272 are unique to regions 422 all the outcodes
So 64% of regions can be found by the outcode alone.
- The rest of the code has the same sort of issues as above. Please consider adding comments to describe your intention while also writing code where functions and objects are given meaningful names. For instance, at first glance I have no idea what this means (note
all
is abuiltin
function in Python so there is a name collision!):
def op(a, s, indent):
if indent == 1:
all = [x for x in results if a in results[x] and len(x) == indent]
else:
all = [x for x in results if a in results[x] and len(x) == indent and x.startswith(s)]
- OK. So this comment starts to make me understand what it is you're perhaps trying to achieve. Please remember that comments are Markdown and the URL could be turned into a link like this:
[https://codepen.io/hawkz/pen/qBpKzaL](https://codepen.io/hawkz/pen/qBpKzaL)
, which will look like this: https://codepen.io/hawkz/pen/qBpKzaL. However, I'm a little confused by the simple app at the end of that URL. Please see my annotations on the UI in the image copied below:
- In the final "bumps along the way..." entry, do you need to keep all the commented out code..? You also mention the
[B|CV]
regex bug, don't don't say how you fixed it (I have to find it in your code - and I'm not sure I've actually found it). You also mention thetqdm
package but don't say what it is, what it's for, nor where to find out more information (presumably here). Finally, the five source files are pasted without any description of how you moved from one version to the next and refined your code. Could you please add some commentary..?
So, please don't be disheartened by the feedback. Much of it can be rectified by remembering to add comments describing intention, naming things appropriately so the code reads in a meaningful manner, and structuring your Markdown in a way that the story unfolds. All of which can be easily updated with a few edits. 🚀
I'm also getting an impression of what the final project might look like for an end user. A sort of fake government service where you enter your postcode and the service tells you the details of your CCG (contact details, reports, website etc...). This would be a great project to get hosted online somewhere. Perhaps, once the comments, names and Markdown are fixed, this is something to refine towards... perhaps a super-simple Flask app freely hosted on PythonAnywhere, or similar..?
Once I have some source code for an app like this I can check you're covering all the bases for the core concepts you need for grade three, and can feedback and finally give you your result.
Great work so far Steve, keep up the good work and please remember to refine the things you've done already. As always, if you have any questions or clarifications, please don't hesitate to reply here. I'm here to help and support you as this project matures into something epic. 🤗