"We seem to be getting a lot of signups from Germany" - so said my fellow administrator on the
First Great Western Coffee Shop forum. At first glance is something of a surprise, as this forum is "provided by a First Great Western Customer, for First Great Western customers" and First Great Western run train services from Paddingon to the West of England and South Wales, with a secondary main line from Portmouth to Cardiff, and regional, local suburban and rural trains on other lines within the same territory, with occasional services venturing as far "off piste" as to Brighton. Nowhere near Germany. So why the interest?
Forums provide an opportunity for people to express their views, add their comments on to others, and post up their information. And as such they can provide a wonderful opportunity for people to get off topic messages onto public readable forums on the Internet. My mailbox contains adverts for pharmaceutical products, get-rich-quick schemes, Books on Steve Jobs (this week), overseas graduate programs, Crocuses, Home Security Systems, dating services, airline tickets and more ... and given half a chance, these same people who, unsolicited, pester me by email would love to advertise on the forum and pester people there too. To keep the wood visible amongst the trees, we limit signups on "The Coffeeshop" to those people who have a genuine interest, and who will post about the issues for which the forum exists. We still get plenty of requests for signup, but our vetting process is such that very few of the "spammers" or rather Wannabe Spammers actually manage to get as far as posting. But it's wasteful of our time, and we're always looking to improve our tools to help us spot the spammers quickly; recently, I added in extra logging of signup requests to help us look at them in a "pageview" mode, and we've now come to the reporting requirement to look at the data that's building up to help keep us even better informed for the future.
So ... the specification for the program and of the requirement looks a bit wooly. And I decided to apply some of the techniques of "Extreme Programming" to the task - writing a short
story as to what we wanted - "We would like to be able to count up how many spanners come from wehere so that we can tell which places are the worst / most likely" and then tackle it through a
spike solution where I wrote experimental code to see how an answer would look. I selected Python for the task (an excellent language for the job, and the language I've been teaching this week) ... and off I headed.
The story turns out to be, as I start coding, to convert data such as:
1 LV Haus finanzieren andrahartwick@gmail.com 91.224.246.15 Thu, 13 Oct 2011 06:26:34 +0100
1 CN cabinet519 zhaominyu15@163.com 113.231.181.142 Thu, 13 Oct 2011 06:26:44 +0100 Shenyang
into results like:
RU 41 Russian Federation
CN 38 China
DE 34 Germany
US 17 United States
UA 16 Ukraine
PL 9 Poland
LV 8 Latvia
etc
and then expands that if necessary (in fact a separete "story") by zone:
CN 38 China
Beijing 18
[unknown] 4
Guangzhou 4
Putian 3
Shenyang 2
Shanghai 2
Jinan 2
Nanjing 1
Wuhan 1
Qingdao 1
Now that I have got to that point in my exploration of the data, if I needed more I would be
refactoring - taking what I have learned and recoding it to make it maintainable. You can see the code
[here] with some quite notable comments pointing out its shortcomings ready for the refacoring exercise if that even comes (and if you want to run the program yourself, there's a data sample
[here]
I'm sharing this example on our web site under our "Data Munging in Python" heading - for even in its raw form it's a good example of some of the techniques commonly used ... in the source, you'll find coding samples of:
• Regular Expressions (to match patterns in data and extract from them)
• Command Line handling (we've used a -v option to select the versbose / by city report)
• Dictionaries (to keep count by countries as we read the data file
• The
urllib2 module (to read a web page from a remote server - the ISO country code lookup!)
• Checking whether a file exists (via
os.path.exists)
• routing non-data output so stderr (via
sys.syderr)
•
lambda (to provide single line functions)
•
read (to slurp an entire file into a variable)
•
title (to take a country name that's SHOUTED AT YOU and reduce it to more manageable speech!)
Truely, so much of the power of any language comes not so much from the power of individual features, but rather from the power of using them in combination, and from reseaching, refactoring and reusing the code that uses those features.
(written 2011-10-14)
Associated topics are indexed as below, or enter http://melksh.am/nnnn for individual articles
Y201 - Python for DataMunging and System Admin [4088] Some tips and techniques for huge data handling in Python - (2013-05-15)
[4211] Handling JSON in Python (and a csv, marshall and pickle comparison) - (2013-11-16)
[4438] Loving programming in Python - and ready to teach YOU how - (2015-02-22)
Y117 - Python - Already written modules [2020] Learning Python - many new example programs - (2009-01-31)
[2506] Good example of recursion in Python - analyse an RSS feed - (2009-11-18)
[2890] Dates and times in Python - (2010-07-27)
[2931] Syncronise - software, trains, and buses. Please! - (2010-08-22)
[3442] A demonstration of how many Python facilities work together - (2011-09-16)
[3465] How can I do an FTP transfer in Python? - (2011-10-05)
[4085] JSON from Python - first principles, easy example - (2013-05-13)
[4086] Cacheing class for Python - using a local SQLite database as a key/value store - (2013-05-14)
[4441] Reading command line parameters in Python - (2015-02-23)
[4452] Binary data handling - Python and Perl - (2015-03-09)
[4696] Programming with random numbers - yet re-using the same values for testing - (2016-06-22)
[4697] Month, Day, Year number to day of week and month names in Python - English and Swedish - (2016-06-23)
[4708] Scons - a build system in Python - building hello world - (2016-10-29)
[4710] Searching a Json or XML structure for a specific key / value pair in Python - (2016-10-30)
G903 - Well House Consultants - Running and moderating forums and social media sites [22] Falling out over the silliest things - (2004-08-21)
[29] Silence is Golden - (2004-08-26)
[115] Expiration dates or times on web pages - (2004-11-12)
[130] Spelling and grammar - (2004-11-25)
[204] The confidence to allow public comments - (2005-02-06)
[231] Feedback as lifeblood - (2005-02-28)
[248] Use me, but use me effectively - (2005-03-16)
[424] How not to run a forum - (2005-08-24)
[516] Open source questions? Anyone can ask. - (2005-12-03)
[651] Please Register with Opentalk - but just once! - (2006-03-19)
[806] Check your user is human. Have him retype a word in a graphic - (2006-07-17)
[828] Freedom of speech and freedom to post - (2006-08-10)
[841] Forum help - a push in the right direction - (2006-08-21)
[919] Freedom for X is denial of privacy for Y - (2006-11-09)
[923] Why shouldn't I spam? - (2006-11-13)
[948] Running an on line campaign - (2006-11-27)
[1088] Why use BBC code not HTML? - (2007-02-21)
[1190] Save the Forum - A regular clean sweep - (2007-05-17)
[1362] No Thank You - (2007-09-23)
[1472] The Horse goes on and on - (2007-12-15)
[1485] Copyright and theft of images, bandwidth and members. - (2007-12-26)
[1523] Ive just received an email from myself. Should I be worried? - (2008-01-29)
[1532] Comment spam blocked. Please comment via Forums - (2008-02-05)
[1539] A forum is not always the best vehicle - (2008-02-14)
[1563] Guidlines for posting on a forum - (2008-03-04)
[1569] I dont care - goodbye - (2008-03-09)
[1578] Please don't shout at me! - (2008-03-16)
[1595] First Great Western Weekend - (2008-03-30)
[1678] Software - changes and delays. But courses must run on time! - (2008-06-15)
[1759] While the world sleeps ... - (2008-08-19)
[1923] Making it all worthwhile - (2008-12-04)
[1972] Pettifog and forum boards away from public view - (2009-01-03)
[2103] Ask the Tutor - Open Source forum - (2009-03-25)
[2116] Why do we delay new forum members through authorisation? - (2009-04-03)
[2156] Stopping forum spam - control of the signup process - (2009-05-04)
[2162] Admins thoughts on banning a member from a forum - (2009-05-09)
[2177] Preventing forum spam - checks at sign up - (2009-05-12)
[2254] Forum membership - a privilege not a right - (2009-06-22)
[2386] Computing under the influence of alcohol - (2009-08-29)
[2526] A reluctance to move from old shoes to new - (2009-12-05)
[2527] Flying tonight - (2009-12-05)
[2569] How to run a successful online poll / petition / survey / consultation - (2010-01-10)
[2781] The 500 pound question to get you started - (2010-05-26)
[2820] Netiquette for forum newcomers - (2010-06-20)
[3910] Identifying your real customers and keeping them well informed fast - (2012-11-02)
[4017] Acceptable User Policy / vexatious interacter - (2013-02-24)
[4025] Backups, Codebase, Strategy and more - dealing with forum incidents - (2013-03-03)
[4065] Handling requests to a forum - the background process - (2013-04-17)
[4234] Change to Libel and Defamation laws from 1st January 2014 - (2013-12-31)
[4239] Facebook marketing - early experiences - (2014-01-19)
[4283] Can a legitimate forum post become illegal a year later? - (2014-07-11)
[4307] Identifying and clearing denial of service attacks on your Apache server - (2014-09-27)
[4315] Welcoming genuine forum posters quickly - but turning away off topic advertisers - (2014-11-16)
[4403] The unbalanced relationship between customer and provider - (2015-01-21)
[4492] Almost so wrong, but perhaps it's right for some? - (2015-05-11)
Some other Articles
Canals, watererways in the Melksham areaTaking a boat down Caen Hill LocksSome thoughts in answer to some Melksham Campus questionsDirect Message: Really horrible blog about you ... a clever phishing trip, said to be from an MPPractical Extraction and Reporting - using Python and Extreme ProgrammingTesting your Python classes with the unittest package - how toChoosing your Python GUI - wx, Qt, Tk or GTK?Tkinter - an easy to use Python Graphic User Interface - introductory examplesHavant - Shop Frontages.Python Packages - groupings of modules. An introduction