2008 post: NCBI GEO submission: howto hints

Ok, NCBI GEO submission of data can be a pain. I mean a big pain.
But there are a few simple things that can make it less painful.

here are my hints and a few steps:

1. Don’t assume that you will get the submission right the first time; it’s easy to have errors.
2. DO assume that NCBI will contact you requesting more information on some things. Be ready.
3. DO save all relevant files; as #2 says, you may get contacted.

And importantly:
4. Remember: some of the annoyance of the system is to ensure that in 5 years… or 10 years, your data will still be comprehensible. As opposed to having it in some weird vendor-specific format… So be patient.
5. Put that you did NCBI GEO submission on your resume. It can’t hurt.

Key Making it easier hints
1. Do all submission when the people generating the data are around. You will be surprised at little things that you need to add that are unclear.
2. You will need all the files for the experiments – you have to put raw files in as a supplement. So get the files together as much as possible.

The Steps: A Protocol
1. Search GEO for an entry that has the exact same type of data/type of array that you are submitting. This will save you huge amounts of time. You don’t want to have to redefine a platform file – it is annoying and will just cost you time and energy. And make the system worse.
2. After finding that file, you will have the platform file (the GPL file number) for the array type that you are using. Make a clear note of this!
3. (Note: there may be better ways to do this, but this works for me) Download the sample file that you found in SOFT format in full. The SOFT format makes uploading files way faster and easier.
4. The SOFT format is a text-format and the opening lines are clear fields. Open the file in a text editor (note: for windows, download and install Notepad++ to do this; it will save you a lot of pain).
5. Cut away the header (maybe 30 or 50 lines) and make a new file. Edit this file with the parameters of your experiment.
6. The hard part is this: you have to make a data file that corresponds to the platform file IDs. This is beyond the scope of this blog post; maybe I will add something about this later.
7. Make a zip file of all the supplementary files (these are the raw data files). I’ll call this SUPP.zip
8. Edit the header to reflect that you are putting in a supplementary file and add the name of this file.
9. Add your header to the datafile (made in step #6). At the end of the datafile, you need an end line. Add this. Save this file. (Again, in windows, Notepad++ is the way to go for this.) I’ll call this file FORGEO.txt
10. Create a second zip archive (I’ll call it TOTAL.zip) containing:
a. FORGEO.txt
b. SUPP.zip
c. Note: this means that TOTAL.zip has exactly two files in it (FORGEO.txt and SUPP.zip).
11. Using the validation option, upload ONLY FORGEO.txt to see if it validates. This is important! It will save you a lot of time to do this. You will get an error about a missing supplementary file, but don’t worry about that.
12. Using direct submission, submit TOTAL.zip using the SOFT option. This will take a long time to load, generally. You will get a screen asking if FORGEO.txt or SUPP.zip is the datafile. Choose FORGEO.txt.
13. You are done with one submission!
14. I suggest that you actually use more informative names than FORGEO.txt and SUPP.zip and TOTAL.zip. I actually name the files with the array number. Like 85012.txt, 85012_supp.zip and 85012_total.zip.
15. IMPORTANT: if you have a lot of files or just big files, the FTP option is best.

Leave a comment