Consistency of values in data extracted is unknown and is dependent on chosen fields and value parameters in NCBI and what constraints the authors of that data applied.
Completeness of data extracted is unknown and is dependent on chosen fields and value parameters in NCBI and what constraints the authors of that data applied. A manual check of the extracted data is highly recommend to ensure completeness.
Source_Information:
Source_Citation:
Citation_Information:
Originator: National Center for Biotechnology Information
Publication_Date: 1982
Title: GenBank
Geospatial_Data_Presentation_Form: website
Publication_Information:
Publication_Place: Bethesda, MD
Publisher:
National Center for Biotechnology Information, U.S. National Library of Medecine
Online_Linkage: http://www.ncbi.nlm.nih.gov/nuccore/
Type_of_Source_Media: database
Source_Time_Period_of_Content:
Time_Period_Information:
Single_Date/Time:
Calendar_Date: 201703
Source_Currentness_Reference: publication date
Source_Citation_Abbreviation: NCBI
Source_Contribution:
The National Center for Biotechnology Information Nucleotide database is a repository for genetic sequences and associated information.
Data obtained primarily consists of sequence identifiers, usually Genus species name, and sequence data, nucleotides.
Source_Information:
Source_Citation:
Citation_Information:
Originator: Python Community
Publication_Date: 199505
Title: Python
Edition: 2.7
Geospatial_Data_Presentation_Form: software library
Online_Linkage: https://www.python.org/
Type_of_Source_Media: computer program
Source_Time_Period_of_Content:
Time_Period_Information:
Single_Date/Time:
Calendar_Date: 20160217
Source_Currentness_Reference: publication date
Source_Citation_Abbreviation: Python
Source_Contribution:
Code libraries in which the bioinformatics scripts are built upon and dependent for code execution.
Source_Information:
Source_Citation:
Citation_Information:
Originator: Biopython
Publication_Date: 2000
Title: Biopython
Edition: 1.68
Geospatial_Data_Presentation_Form: software library
Online_Linkage: http://biopython.org/wiki/Biopython
Type_of_Source_Media: computer program
Source_Time_Period_of_Content:
Time_Period_Information:
Single_Date/Time:
Calendar_Date: 20160825
Source_Currentness_Reference: publication date
Source_Citation_Abbreviation: Biopython
Source_Contribution:
Biopython provides additional code libraries to supplement the Python code libraries in order to increase the functionality to the Python scripts.
Process_Step:
Process_Description:
checkGenBank.py
Description: Searches the National Center for Biotechnology Information GenBank database using two different user developed text (.txt) files. One text file contains a list of Genus species names and a different text file (.txt) contains the search terms (usually genes). Outputs a .csv file containing the Genus species name, search term, and number of hits in separate columns for every possible combination. Also outputs two .txt files, one of all successful searches and one of all failed searches. If the Genus species file contains both Genus species name and common name, they must be separated by a comma.
Process_Date: 201703
Process_Step:
Process_Description:
convertFASTQtoFASTA.py
Description: Converts a .fastq file to a .fasta file.
Process_Date: 201703
Process_Step:
Process_Description:
convertFASTQtoFASTAfolder.py
Description: Converts all .fastq files in this folder to .fasta files.
Process_Date: 201703
Process_Step:
Process_Description:
cropFASTA.py
Description: Opens a fasta file, requests primers of interest, searches .fasta file for primers, outputs one .fasta file with all sequences that contain those primers.
Process_Date: 201703
Process_Step:
Process_Description:
filterFASTA.py Description: Checks a .fasta file for identical sequence and taxonomic information. If identical sequences are found that do not have identical taxonomic information that data is written to a text file ending with _rnw.txt (repeats not written). If identical sequences are found that do have identical taxonomic information, one sequence with associated taxonomic information is written to a .fasta file ending with _rr.fasta (repeats written).
Process_Date: 201703
Process_Step:
Process_Description:
filterFASTAfolder.py Description: Checks all .fasta files in current folder, removes all identical sequences that contain more than one unique Genus species identifier for each .fasta file individually. Outputs one .fasta file containing only unique sequences with unique Genus species identifiers for each .fasta file. Also outputs a .txt file containing all removed sequences for each .fasta file.
Process_Date: 201703
Process_Step:
Process_Description:
findPrimers.py Description: Using an aligned .fasta file, searches for consensus sequences (segments) of a predetermined length (can be user modified, default is 18). If these segments are greater than or equal to the predetermined length they are considered potential primers sites and written to two .txt files. One .txt file contains just the segments, the other contains the segments and their position number in the sequence. Note: this will not find degenerate nucleotides! A manual search of the aligned .fasta should be conducted after running this script in order to find potential segments containing degenerate nucleotides.
Process_Date: 201703
Process_Step:
Process_Description:
getFASTA.py Description: Searches the National Center for Biotechnology Information GenBank database using one text file (.txt) containing a list of Genus species names. User defined inputs include desired output file format, .gb or .fasta, number of output files (all data in one file or separate files for each Genus species name, and search term(s) (usually genes) that utilize Boolean operators. Internally stores only unique Accession numbers and outputs all data in either GenBank (.gb) or FASTA (.fasta) format. If the Genus species file contains both Genus species name and common name, they must be separated by a comma.
Process_Date: 201703
Process_Step:
Process_Description:
getFragments.py Description: Cuts an aligned .fasta file into fragments using a list of potential primers sites and returns all potential fragment combinations in .fasta format. This script assumes the required length of potential fragments is no shorter than 41 bps and no longer than 350 bps. These values can be changed by the user. The potential primer sites input file must be in .txt format.
Process_Date: 201703
Process_Step:
Process_Description:
getFragments.py Description: Cuts an aligned .fasta file into fragments using a list of potential primers sites and returns all potential fragment combinations in .fasta format. This script assumes the required length of potential fragments is no shorter than 41 bps and no longer than 350 bps. These values can be changed by the user. The potential primer sites input file must be in .txt format.
Process_Date: 201703
Process_Step:
Process_Description:
matchRefDB.py Description: Opens a .fasta file to be used as the reference database, searches each .fasta file in the folder individually for sequence matches to the reference database. Outputs a .csv file with the number of matches to reference sequences and one .fasta file with all sequences that did not match any sequence within the reference database.
Process_Date: 201703
Process_Step:
Process_Description:
mergePairedEndFASTAs.py Description: Searches all files in current folder, merges forward (R1) and reverse (R2) Illumina MiSeq reads using a user-adjustable global alignment.
Process_Date: 201703
Process_Step:
Process_Description:
verifyFASTA.py Description: Checks a .fasta file, counts the number of times a sequence is repeated, writes the sequence, the number of times that sequences occurs, and the identifiers associated with the sequence. Only writes information for sequences that have two (2) or more identifiers. Best used after filterFASTA.py to verify all remaining sequences only have one (1) unique identifier.
Process_Date: 201703
Process_Step:
Process_Description:
verifyFASTAfolder.py Description: Checks all .fasta files in current folder individually, counts the number of times a sequence is repeated, writes the sequence, the number of times that sequences occurs, and the identifiers associated with the sequence. Only writes information for sequences that have two (2) or more identifiers. Best used after filterFASTA.py to verify all remaining sequences only have one (1) unique identifier.
Process_Date: 201703