*Data manipulation for graphs for Cross validated blog post. *Any questions or problems feel free to contact me, apwheele@gmail[dot]com. *this was run under several SPSS versions (16,17,& 19!), so it should work for awhile, also ran under Window's XP and Window's 7. *where I extracted the SO xml tables to, note I save everything in the same place. *FILE HANDLE DATA /NAME = "YOUR FOLDER FOR THE DATA HERE". *SPSS does not have the native ability to read xml data, but the text data is pretty easily parsed in SPSS. *actually given the size of the files, SPSS works pretty well for the job, although it is a good overnight job due to the size of the files. ***********************************************************. *READING IN THE POST DATA AND GETTING THE INFO. *this takes awhile. GET DATA /TYPE=TXT /FILE='DATA\posts.xml' /DELCASE=LINE /DELIMITERS="" /ARRANGEMENT=DELIMITED /FIRSTCASE=3 /IMPORTCASE=ALL /VARIABLES= V1 A4666. CACHE. EXECUTE. DATASET NAME posts. *If you dont want to do the whole run at once, I would advise saving here, it takes much. *less time to read the SPSS dataset than to import the text/xml. *save outfile = "DATA\posts_orig.sav". *what does CACHE do?, should I not specify CACHE?. *when I first did the run, I Saved this file and it was over 9 gigabytes! (the original xml doc is 6.2 gig). *Initially I processed the string to eliminate records, but I believe I would have been better off extracting the data I needed first and then selecting out the records. *the most recent id is in the millions. compute ID_loc = Char.Index(V1," Id="). string ID_A (A7). compute ID_A = CHAR.SUBSTR(V1,(ID_loc + 5),7). *Jon skeets highest upvote is over 4000, I assume score wont be any more than 5 digits. compute score_loc = Char.Index(V1,' Score="'). string score_A (A5). compute score_A = CHAR.SUBSTR(V1,(score_loc + 8),5). *the highest user id was in the hundred thousands. compute owner_loc = Char.Index(V1,' OwnerUserId="'). string owner_A (A6). compute owner_A = CHAR.SUBSTR(V1,(owner_loc + 14),6). *this is a flag for whether I need the info or not. compute cw_flag = 0. if CHAR.INDEX(V1,'CommunityOwnedDate') > 0 cw_flag = 1. compute answer_flag = 1. if CHAR.INDEX(V1,'PostTypeId="2"') > 0 answer_flag = 1. execute. *lets make the variables in nicer number formats (and get rid of extraneous text). compute ID_trim = Char.Index(ID_A,'"'). if ID_trim <= 0 ID_trim = 7. string ID_2 (A7). compute ID_2 = CHAR.SUBSTR(ID_A,1,(ID_trim-1)). compute post_id = NUMBER(ID_2,F7.0). formats post_id (F7.0). execute. compute score_trim = Char.Index(score_A,'"'). if score_trim <= 0 score_trim = 5. string score_2 (A5). compute score_2 = CHAR.SUBSTR(score_A,1,(score_trim-1)). compute score = NUMBER(score_2,F4.0). formats score (F4.0). execute. compute owner_trim = Char.Index(owner_A,'"'). if owner_trim <= 0 owner_trim = 6. compute OwnerUserId = NUMBER(CHAR.SUBSTR(owner_A,1,(owner_trim-1)),F6.0). formats OwnerUserId (F6.0). execute. save outfile = 'DATA\posts_edit1.sav' /KEEP post_id score OwnerUserId cw_flag answer_flag /COMPRESSED. dataset close ALL. get file = 'DATA\posts_edit1.sav'. dataset name posts2. select if cw_flag = 0 and answer_flag = 1. execute. save outfile = 'DATA\posts_edit2.sav' /KEEP post_id score OwnerUserId /COMPRESSED. **********************************************************************. *READING IN THE USER DATA. *closing the old dataset. dataset close ALL. *I'm only going to import the first 100 characters, I dont need any other info. GET DATA /TYPE=TXT /FILE='DATA\users.xml' /DELCASE=LINE /DELIMITERS="" /ARRANGEMENT=DELIMITED /FIRSTCASE=3 /IMPORTCASE=ALL /VARIABLES= V1 A100. CACHE. EXECUTE. DATASET NAME Users. *now parsing the file. compute ID_loc = Char.Index(V1," Id="). compute Rep_loc = Char.Index(V1," Reputation="). compute create_loc = Char.Index(V1," CreationDate="). execute. string ID_A (A6). compute ID_A = CHAR.SUBSTR(V1,(ID_loc + 5),(Rep_loc - (ID_loc + 6))). string Rep_A (A6). compute Rep_A = CHAR.SUBSTR(V1,(Rep_loc + 13),(create_loc - (rep_loc + 14))). execute. match files file = * /keep ID_A Rep_A. execute. *making formatted variables for these. compute OwnerUserId = NUMBER(ID_A,F6.0). formats OwnerUserId (F8.0). compute current_rep = NUMBER(Rep_A,F6.0). formats current_rep (F6.0). execute. match files file = * /keep OwnerUserId current_rep. execute. *I can get rid of the tags on the end. select if $casenum < 644813. execute. SAVE OUTFILE = 'DATA\users.sav' /COMPRESSED. *this file size is not so bad, only 12,300 KB, I could even get rid of creation date, I dont really need it. **************************************************************************. *MAKING THE DATA USED IN THE FIRST GRAPH, current rep vs score on most recent answer. dataset close ALL. *getting the posts file. GET FILE = "DATA\posts_edit2.sav". DATASET NAME posts WINDOW=FRONT. *this dataset only contains the post id, the owner id, and the score on the post. *this dataset only includes non community wiki answers (see other file for how I made this file and trimmed the records). *selecting the most recent post. sort cases by OwnerUserId (A) post_id (D). compute dup = 0. if OwnerUserId = lag(OwnerUserID) dup = 1. if MISSING(OwnerUserId) = 1 dup = 1. select if dup = 0. match files file = * /drop dup. execute. *now to get the users dataset. get file = "DATA\users.sav". dataset name users WINDOW=FRONT. sort cases by OwnerUserId. execute. match files file = * /table = 'posts' /in recent_post /by OwnerUserId. execute. dataset close posts. select if recent_post = 1. execute. *now to export to csv file that I used to work with in R. SAVE TRANSLATE OUTFILE='DATA\recentpost_and_rep.csv' /TYPE=CSV /MAP /REPLACE /FIELDNAMES /CELLS=VALUES /KEEP current_rep score. ****************************************************************. *MAKING THE CSV FILE USED FOR GRAPH 2, HIGH REP USERS COUNTER. Dataset close ALL. get file = "DATA\users.sav". *here I defined high rep users as those having a current_reputation over 50,000. compute high_score = 0. if current_rep >=50000 high_score = 1. select if high_score = 1. match files file = * /keep OwnerUserId. execute. sort cases by OwnerUserId. dataset name high_rep. *now to merge that file into the posts file. GET FILE = "DATA\posts_edit2.sav". DATASET NAME posts WINDOW=FRONT. sort cases by OwnerUserId. match files file = * /table = 'high_rep' /in flag /by OwnerUserId. dataset close high_rep. *making flag for Jon Skeet. compute Jon_Skeet = 0. if OwnerUserId = 22656 Jon_Skeet = 1. *selecting out posts that are not high rep users. select if flag = 1. execute. formats Jon_Skeet (F1.0). sort cases by OwnerUserId post_id (A). execute. *I do this to make a post counter later. compute case_num = $casenum. execute. *now aggregating the minimum value for case_num by person, then all I have to do is substract that from case_num and add 1 and I will have my counter, also calculate the mean score for answer per person. AGGREGATE /OUTFILE = * MODE=ADDVARIABLES /BREAK = OwnerUserId /case_min = MIN(case_num) /mean_score = MEAN(score). execute. compute post_count = case_num - case_min + 1. execute. formats post_count (F6.0). compute score_dev = score - mean_score. execute. *now to save this as a csv file. SAVE TRANSLATE OUTFILE='DATA\top_scorer_posts.csv' /TYPE=CSV /MAP /REPLACE /FIELDNAMES /CELLS=VALUES /KEEP post_count score_dev Jon_Skeet.