How to extract all the IUPAC names mentioned in the data available from Pubchem(NCBI) into a text file?
I want to build lists of prefixes and suffixes of some length from all the IUPAC names mentioned in Pubchem Database,so that I can use them fu开发者_Python百科rther in my project as a feature.So I want all the IUPAC chemical names in a text file or in some format where I can extract these lists.
Thanks.
Sounds you need something like this Nist species list
You can search for most also in the Webbook but I failed to find a download link for the complete set.
In our lab we got a Cd(?) with the mass spectral database which contained the (complete? - well it got like 250.000 substances) database as text file. Maybe you can get that through some of the vendors.
The pubchem site offers you to download a dump of their data by ftp. Why not use that?
PubChem data can be downloaded via ftp from the PubChem site. A complete description of the available data can be obtained here: https://pubchemdocs.ncbi.nlm.nih.gov/downloads
Of particular interest for the question of IUPAC names, the data are downloadable from the "Compound Extras" section of the ftp site: ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/
The README-Extras file in this location describes the data in detail. For the IUPAC names, the following information is provided:
CID-IUPAC.gz:
This is a listing of all CIDs with their computed IUPAC names. It is a gzipped text file with CID, tab, IUPAC on each line. Note that the names may contain UTF8 characters.
A download today (23-Apr-2020) contains 102,586,778 rows. An excerpt of the information is shown below.
> head CID-IUPAC
1 3-acetyloxy-4-(trimethylazaniumyl)butanoate
2 (2-acetyloxy-3-carboxypropyl)-trimethylazanium
3 5,6-dihydroxycyclohexa-1,3-diene-1-carboxylic acid
4 1-aminopropan-2-ol
5 (3-amino-2-oxopropyl) dihydrogen phosphate
6 1-chloro-2,4-dinitrobenzene
7 9-ethylpurin-6-amine
8 2,3-dihydroxy-3-methylpentanoic acid
9 (2,3,4,5,6-pentahydroxycyclohexyl) dihydrogen phosphate
11 1,2-dichloroethane
精彩评论