Ironically, despite being a mixture of rather obvious commonsense stuff with idiosyncratic practical tips, this one page of tips is probably a more important contribution to humanity than my entire thesis (because the results of my thesis are only applicable to people in a specialized field, whereas many, many people do the sort of data analysis that these tips would help with).
Please note that although i say 'big data' here, what i really mean to address is situations where your analysis subroutines take more than a couple of hours to run, regardless of how much or how little memory/storage space the data takes up.
- You probably expect to spend most of your time analyzing data, but actually you'll spend most or at least a good part of your time writing little scripts to convert between one data format and other
- "In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data." -- @BigDataBorat 26 Feb 2013
- More important than the choice of algorithm is the choice of normalization (although different algorithms demand different types of normalization).
- some popular types of normalization (note some of these conflict with each other, so you must choose):
- subtract the mean (so that the new mean is 0)
- divide by the std (so that the new std is 1)
- subtract the min (so that the new min is 0)
- divide by the max (so that the new max is 1)
- whiten data (if the data were a multivariate gaussian, this would turn a slanted elipse into a circle)
- throw out data entries which are nan
- throw out data entries which are nan, but replace them with zeros
- throw out data entries which are nan, but replace them with the mean of other data entries
- when you have multielement data entries like rows, columns, or images: throw out data entries whose standard deviation is below some threshold
- throw out outliers (this is often considered separate from 'normalization' btw)
- for matrices only:
- symmetricize
- multiple rows and columns so as to make 1s all along the diagonal
- if someone just says "normalize" and you don't know which kind of normalization they mean, they probably mean whitening, and/or subtracting the mean and dividing by the std
- It may be useful to try using a data analysis workbench before getting dirty rolling your own; i've heard good things about RapidMiner?.
- I like Python, with NumPy? (arrays for Python) and SciPy? (various miscellaneous data analysis libraries) and matplotlib (plotting)
- on my system, ipython has a weird quirk where if you paste in stuff containing lines that are very long, sometimes the long line or the next line is truncated or mishandled (i forgot exactly what the problem is; i think it only happened if the long line was a string or a comment or something). Also, you can't paste too many lines at once or it truncates. Beware.
- use 'ipython --pylab' to start ipython. This automatically loads matplotlib, which gives you commands like 'plot'. On macs, if you don't do this, you have to do pylab.show() after creating a plot, otherwise nothing happens.
- matplotlib.mlab.find is a useful function
- reload is a useful ipython function
- 'from IPython.frontend.terminal.embed import InteractiveShellEmbed?' and then 'keyboard = InteractiveShellEmbed?()' is useful. Then you can do 'keyboard()' in your code to go into a 'debug mode' like in Octave or Matlab. But running InteractiveShellEmbed?() screws up ipython's reload sometimes so that it deletes old commandline history, so comment out the line 'keyboard = InteractiveShellEmbed?()' except when you are using it
- cPickle is useful; but if all you are saving is a set of named scalar numbers, vectors and matrices, use scipy.io.matlab.savemat and scipy.io.matlab.loadmat, they take less space and are faster for large matrices
- copy.deepcopy is useful
- i have a library called bshanks_scipy_misc with various useful functions, most usefuly 'imss', which plots a matrix similarly to imshow, but such that when you mouseover a pixel it tells you the exactly location and value of that pixel in the corner of the plot window. imss is composed of extensible parts, so that you can look at its source and see how to make it do other stuff, too.
- Keep an informal 'lab journal':
- have a plaintext file
- keep in open in an editor while you work
- use the lab journal to write down your comments, but more importantly, bits of code that you used to analyze stuff
- to separate entries in the file, put a few empty lines, then '--' on a line of its own, then a few empty lines; put the newest entry on top, rather than at the bottom (e.g. the journal is in reverse order)
- whenever you save something to a file, make an entry in the journal that shows exactly what steps you used to produce that thing, starting with the result of one of the last few journal entries.
- Since this isn't a formal lab journal, you can and should edit the steps to make them cleaner (remove mistaken steps); but if you do so, you should close your ipython session, start a new one, and paste in the journal entry to execute it, to make sure that your editing didn't introduce an error. Use ipython's %cpaste to paste it in (yeah this is very annoying)
- much of the time, you should be writing code in the journal and then pasting it into ipython, to save yourself the hassle of copying it into the journal afterwards and then testing the copy. However, when you are fooling around and expect to make lots of mistakes, you can work in ipython and copy to the journal afterwards.
- make sure that every few entries contains the date, either because you wrote it there just for reference, or because you had a command to save a data file which included the date in the filename, and you manually entered the date part
- most of the time, you only want copy-and-pastable code in the journal, not the results printed out by ipython. But sometimes you want the results too. When easy, either put the results in their own journal entry, or put them in a format (e.g. with lines prefixed by #) such that you can copy and paste everything in the journal entry.
- put a few empty lines at the beginning and end of each journal entry, and between any parts that you would copy and paste separately. This makes it faster to copy and paste because you don't have to 'aim' so precisely (see Fitts's law)
- Keep your code under version control. I like git.
- when saving data files, put the dataset source, the choices of methods used to generate the data, and the parameters used, in the filename, separated by underscores. Somewhere in there (i try to put it at the end), put the date that the file was generated (i use the format YYMMDD).
- besides just having that older data around, a major benefit of having the data in the filename is that you'll be able to quickly find the lab journal entries that correspond to a given data set, and vice versa
- Keep data files from older versions/attempts around. You never know when you will find that the stuff you have been doing for the last month is giving results that are in some way worse than older results, at which time you'll want to compare exactly what the differences were with the older intermediate data.
- For things that take a long time to run, don't just write an in-memory pipeline; save the results of each step to a (different) file. That is good because (a) when you modify something, you can start with the result of the previous step, rather than doing redundant re-computation; (b) if something crashes, you can pick up right before the part that crashed.
- Whenever you notice that a step produced nonsense because you fed it invalid input, you MUST put in an assertion to check for that type of invalid input (if possible). This is especially true if the step runs for longer when fed nonsense. Otherwise, someday, you will find that you run some pipeline for two weeks only to get a nonsensical result because you forgot to check the input for that condition.
- When you find that you are making a lot of journal entries with repeated code, create a new subroutine in your code for those steps
- In your code, distinguish functions that operate on data from in-memory variables provided as arguments, and which don't write to disk, from functions that are given filenames and load stuff from disk, and save intermediate or final results to disk. I do this by having the name of the latter types of functions end in '_pipeline'. All or almost all of the actual computation, the 'meat' of the _pipeline functions, should be accomplishd by calling other, non-pipeline functions.
- At the end of your project, you should have a linear sequence of _pipeline functions to call that reproduce your entire analysis from the initial data (this might take weeks to run). The only reason you should have multiple _pipeline functions instead of just one that does the whole project is when a human must manually look at the result of a previous computation and then specify a parameter for a subsequent computation.
- Whenever you find that you stop modifying the parts of the analysis before a certain point, and are just using the same intermediate data result over and over, you need to make sure that you will remember exactly how you got that intermediate result later. The best thing to do is to create a sequence of _pipeline functions that get you up to that result from the beginning (assuming you have already been doing this, you can start with the results of the previous _pipeline function and only have to add a new _pipeline subfunction for the new steps). If you don't want to spend time on that right now, at least make sure that there is a journal entry or sequence of journal entries that you can paste into ipython to get the result, and also that you have prominently labeled those entries so that you can quickly find them again later, distinguishing them from similar-looking false starts which are also written down in the journal, and so that you can tell what the correct sequence for them it.
- Don't prematurely optimize.
- If you want to make something a little faster than ordinary Python without rewriting it in C, try Cython. Just by annotating the types of some variables in inner loops, and by turning off bounds checking, you can get significant speedup. You can use 'import pyximport; pyximport.install(); ' to autocompile. But if you use pyximport, you must quit and restart ipython to get it to recompile after you change the source.
- Before getting fancy, try basic stuff provided in some SciPy? library. E.g. means, standard deviations, SVMs for supervised learning, PCA for and its non-square cousin SVD for dimensionality reduction, k-means or/and hcluster for clustering, correlation for supervised feature selection.
- More important than the choice of algorithm is the choice of what sort of prior knowledge you are providing to the algorithm. But the choice of algorithm is important too.
- often your algorithms will work better if you give them un-simplified input. For example, a human would prefer to look at a hard clustering (partitioning) rather than a soft clustering. A human prefers to think in terms of categorical or even binary features rather than graded. A human would prefer to look at 20 features each of which is distinct rather than 90 features, many of which look similar. So, for humans, you might want to partition, and you might want to do feature selection. But many algorithms give better results if you give them as input things like soft clusterings, graded inputs, and many features.
- if you are doing prediction, for most applications overfitting is a big problem, so it's often better to select a model that you know is not expressive enough to get the right answer even in theory, e.g. to select a linear model to predict a non-linear situation.
- think about what evidence you will provide to convince your peers that your conclusions are correct. Consider making this evidence your objective function, or otherwise restructuring your algorithm around this sort of evidence.
- having multiple sources of data is crucial for controlling artifacts and noise
- there is no substitute for prior knowledge of what kind of artifacts occur in your data set
Related