python - Big data File: Read and Create structured file -
i have 20+gb dataset structured follows:
1 3 1 2 2 3 1 4 2 1 3 4 4 2
(note: repetition intentional , there no inherent order in either column.)
i want construct file in following format:
1: 2, 3, 4 2: 3, 1 3: 4 4: 2
here problem; have tried writing scripts in both python , c++ load in file, create long strings, , write file line-by-line. seems, however, neither language capable of handling task @ hand. have suggestions how tackle problem? specifically, there particular method/program optimal this? or guided directions appreciated.
you can try using hadoop. can run stand-alone map reduce program. mapper output first column key , second column value. outputs same key go 1 reducer. have key , list of values key. can run through values list , output (key, valuestring) final output desire. can start simple hadoop tutorial , mapper , reducer suggested. however, i've not tried scale 20gb data on stand-alone hadoop system. may try. hope helps.
Comments
Post a Comment