Loci for insulin processing and secretion provide insight into type 2 diabetes risk

Insulin secretion is critical for glucose homeostasis, and increased levels of the precursor proinsulin relative to insulin indicate pancreatic islet beta-cell stress and insufficient insulin secretory capacity in the setting of insulin resistance. We conducted meta-analyses of genome-wide association results for fasting proinsulin from 16 European-ancestry studies in 45,861 individuals. We found 36 independent signals at 30 loci (p-value < 5x10-8), which validated 12 previously reported loci for proinsulin and 10 additional loci previously identified for another glycemic trait. Half of the alleles associated with higher proinsulin showed higher vs. lower effects on glucose levels, corresponding to different mechanisms. Proinsulin loci included genes that affect prohormone convertases, beta-cell dysfunction, vesicle trafficking, beta-cell transcriptional regulation, and lysosomes/autophagy processes. We colocalized 11 proinsulin signals with islet expression quantitative trait loci (eQTL) data, suggesting candidate genes including ARSG, WIPI1, SLC7A14, and SIX3. The NKX6-3/ANK1 proinsulin signal colocalized with a T2D signal and an adipose ANK1 eQTL signal, but not the islet NKX6-3 eQTL. Signals were enriched for islet enhancers, and we showed a plausible islet regulatory mechanism for the lead signal in the MADD locus. These results show how detailed genetic studies of an intermediate phenotype can elucidate mechanisms predisposing to disease.


Introduction
Proinsulin is a precursor to insulin that is formed in pancreatic beta cells. Some proinsulin is secreted into the plasma during insulin biosynthesis and secretion, and circulating levels of proinsulin relative to insulin are increased in individuals with type 2 diabetes (T2D) and pre-diabetes [1][2][3] . Elevated proinsulin relative to insulin in individuals with pre-diabetes and T2D patients may be caused by increased demand on beta cells to release insulin, thereby encouraging the premature release of granules that contain a higher ratio of proinsulin to mature insulin 3 . Conversely, reduced proinsulin-to-insulin levels could result from defects in proinsulin processing and folding prior to cleavage into insulin, early defects in vesicular processing, or altered proinsulin vs insulin degredation 4 .
Proinsulin can serve as a valuable intermediate phenotype to aid identification of genetic variations influencing hyperglycemia and T2D 5 . Additionally, the allelic effect directions on glucose vs proinsulin can help differentiate known T2D loci into those involved in beta cell stress versus defects in proinsulin processing and secretion 3,4,[6][7][8][9] . Previous proinsulin genome-wide association studies (GWAS) reported 16 signals at 13 genomic loci. These studies included a meta-analysis of 10,700 discovery participants that reported 10 loci 5 , a subsequent exome array study of Finnish individuals that identified two more loci with low-frequency (minor allele frequency (MAF) < 5%) variants 10 , and a genetic study of participants with high risk for cardiovascular diseases (CVD) which identified another locus 11 . To provide a comprehensive genetic analysis of proinsulin and gain insight into glycemic trait dysregulation, we performed a large meta-analysis of proinsulin GWAS. This study quadrupled the sample size of the largest previous meta-analysis and doubled the number of proinsulin association signals, implicating candidate genes that regulate insulin processing and glucose regulation.

Cohort/study description
As part of the Meta-Analysis of Glycemic and Insulin traits Consortium (MAGIC), we conducted a meta-analysis of GWAS results for fasting proinsulin levels from 16 Europeanancestry cohorts in up to 45,861 individuals (Table S1). Each of 16 cohorts collected trait and genotype data, assessed quality, and performed association analyses (Table S1). Each cohort performed imputation and reported all variants to Genome Reference Consortium Human Build 37/hg19 12 . Study participants who had diabetes, were on a diabetes treatment, or had fasting glucose ≥7 mmol/L, 2-hour glucose ≥11.1 mmol/L, or hemoglobin A1c (HbA1c) ≥6.5% (48 mmol/mol) were excluded. Fasting proinsulin values (pmol/L) were natural logarithm transformed and analyses adjusted for age, sex, population structure, and natural logarithm of fasting insulin (study-level details of fasting requirements, sample collection, and population structure adjustments are in Table S1). Study analysts ran models adjusted and unadjusted for body mass index (BMI). To control for type I error rate of low-frequency variants and to fully remove trait-covariate correlations, covariate adjustment was performed in two steps 13 . Analysts first modeled natural logarithm of fasting proinsulin on all covariates, then inverse normal transformed the residuals. Analysts then modeled the inverse normally transformed residuals on the covariates again and used these residuals in the final regression analysis. Analysts used an additive-model in a linear/linear mixed-model framework using software including EPACTS, rvtests, and PLINK [14][15][16] .

Study-level quality control (QC)
Central analysts assessed each cohort input file for quality control using EasyQC 17 . We excluded variants with low minor allele count (<3) or low minor allele frequency (MAF < 0.005), low call rate (<95%), deviation from Hardy-Weinberg equilibrium (HWE) (p-value<0.00001), low imputation quality (r 2 <0.3), or exceptionally large effect standard errors (standard error >10). We also examined quantile-quantile (QQ) plots by frequency bins, assessed trends in standard errors relative to sample size, and checked allele frequencies relative to their frequency in HRC. Systematic QC issues for a study were resolved prior to inclusion in the meta-analyses.

GWAS meta-analysis
We performed a fixed-effects inverse-variance weighted meta-analysis using METAL 18 using effect size estimates and standard error. We applied genomic-control (GC) on summary statistics for each study and also following the meta-analysis. Post-meta-analysis inclusion criteria required that variants were represented by at least one quarter of the maximum sample size, in at least two studies, and had an overall MAF > 0.005; we analyzed 9,533,557 variants. We defined a locus as a lead variant p-value <5x10 -8 and all variants within 500 kb. We used SWISS (https://github.com/statgen/swiss) to identify the lead variant for each locus and combined adjacent loci whose lead variants exhibited linkage disequilibrium (LD) (r 2 > 0.4) to form an extended locus region. All linkage disequilibrium (LD) calculations are based on 1000 Genomes Europeans unless otherwise noted. We estimated the proportion of variance explained by each variant as 2β 2 f(1-f) where β is the effect size from METAL and f is the average effect allele frequency in the meta-analysis. We summed the variants' proportion of variance to estimate total fasting proinsulin variance explained.

Approximate conditional analysis
To identify conditionally distinct signals within a locus, we performed approximate conditional analysis using GCTA 19,20 . To reduce collinearity, we excluded any variant from designation as part of a distinct signal if its multiple regression r 2 on the other selected variants was greater than 0.8. Since no lead proinsulin variant was within 1 Mb of another, and we noted regions of extended LD surrounding at least one lead proinsulin variant, we analyzed all variants within 1 Mb of each lead variant or the extended locus region, whichever was larger. Given that GCTA depends on use of a large representative LD reference panel, we compared results from three genotype-level reference panels: METSIM (n=10,070) 21 and Fenland (n= 8,925) 22 are the two largest studies in the meta-analysis that combined represent 38% of the total sample size, and Electronic MEdical Records and GEnomics (eMERGE, dbGaP Study Accession: phs000888.v1.p1) (n=6,795) is a Europeanonly general research subset 23 . We defined a signal as conditionally distinct if a variant from GCTA representing the signal was identified with at least two of the three reference panels and the variants were proxies of each other (r 2 > 0.8). We additionally required variants to have consistent MAF across the summary data and the reference panels; the MAF of rs181143493 near ARAP1 was 0.12 in the proinsulin summary results and <0.01 in both the METSIM and eMERGE reference panels and therefore was excluded. Due to limitations in approximate conditional analysis with an external LD reference panel, we report at most three signals within a locus.

Colocalization with glycemic traits
We assessed signal overlap, or colocalization, between the 36 primary and secondary proinsulin signals and the conditionally distinct signals reported by three T2D studies: the European-ancestry component of DIAbetes Meta-ANalysis of Trans-Ethnic association studies (DIAMANTE EUR) 24 , the full multi-ancestry DIAMANTE analysis (DIAMANTE TA) 25 , Asian Genetic Epidemiology Network (AGEN)/East Asian ancestry (EAS) DIAMANTE 26 , and four European-ancestry Meta-Analysis of Glucose and Insulin-related traits Consortium (MAGIC) glycemic traits: fasting glucose, fasting insulin, HbA1c, and glucose 2 hours after a glucose challenge 27 . We tested for colocalization using two strategies: colocalization based on pairwise LD (r 2 > 0.8) between the lead proinsulin variant and the lead variant for another trait, and a Bayesian multi-trait colocalization approach, either HyPrColoc 28 or coloc 29 . Due to differences in ancestry across proinsulin versus AGEN and DIAMANTE TA, we ran HyPrColoc with proinsulin, DIAMANTE EUR, and the four MAGIC traits. We observed some issues with sensitivity using HyPrColoc, including unstable trait clusters and deflated PPFC values when multiple signals in the cluster are marginally significant. While multi-trait HyPrColoc provided a beneficial firstpass assessment for colocalization, sensitivity analyses using pairwise colocalization helped fine-tune the specific studies that colocalized with our proinsulin data. Therefore, we compared HyPrColoc's multi-trait performance against a series of 2-trait colocalization analyses (i.e., proinsulin and results for only one of the other 5 traits).
We performed HyPrColoc analyses using predefined, approximately independent LD blocks, and included all traits that had at least one variant with a p-value <10 -4 within the LD block 30 . We selected the default HyPrColoc settings (prior.1 = .0001, prior.2 = 0.98). We then ran sensitivity analyses, varying the regional alignment thresholds from 0.6 to 0.9, the alignment thresholds from 0.6 to 0.9, and the prior.2 from 0.98 to 0.995. Since Bayesian colocalization methods may be sensitive to differences in ancestry across studies, we separately performed 2-trait coloc analyses between proinsulin signals and genome-wide significant DIAMANTE TA signals then proinsulin and AGEN T2D signals. We selected coloc's default prior probability of colocalization of 1x10 -5 and ran sensitivity analyses varying the priors across 100 values. The cumulative sensitivity score for HyprColoc and coloc was the proportion of scores that identified a colocalization and ranged from 0 (no sensitivity tests identify colocalization) to 1 (all sensitivity tests identify colocalization). Given limitations in colocalization approaches, we considered both Bayesian methods and LD; we considered the signals colocalized if the Bayesian posterior probability of colocalization was >0.6 and either the sensitivity score was >0.4 or LD r 2 > 0. 8

Characterization of proinsulin locus effect directions to other glycemic traits
To assess the direction of effect of proinsulin signals on T2D and common glycemic traits, we looked up associations for proinsulin lead variants in the summary results for T2D in aforementioned three studies and the four glycemic traits in MAGIC studies [24][25][26][27] . If a proinsulin lead signal was associated with T2D or fasting glucose (p-value <10 -4 ) or at least two outcomes in the same direction at a more lenient p-value threshold (p-value <0.01), we reported the consensus direction of effect. To evaluate proinsulin variant association with additional glycemic traits, we performed similar look ups in the summary results for 34 glycemic traits analyzed in the METSIM study (Table S2) 10 ; briefly, these traits included proinsulin, glucose, and insulin levels at fasting, after an oral glucose tolerance test (30 minutes to 120 minutes), and calculated areas under the curve measures as well as C-peptide, HbA1c, insulinogenic index, Matsuda index, and T2D. We analyzed the 34 traits as a subset of a total of 1076 baseline traits for association with variants imputed using a reference panel from a subset of METSIM with whole genome sequencing 31 . For glucose and insulin metabolic traits, we excluded individuals known to be diabetic at baseline. For each quantitative trait, we inverse normalized the trait, regressed on covariates (see Table S2 for covariates per trait), and inverse normalized the residuals. We carried out single-variant association tests using a linear mixed model in SAIGE v0.39 (https:// github.com/weizhouUMICH/SAIGE) on the normalized residual trait values.
We additionally looked up proinsulin lead variants for loci not identified in T2D or glycemic trait association results. We used genetics.opentargets.org to find significant associations (p-value < 5x10 -4 ) with the lead variants at these loci 32,33 . The online resource identifies associations from the GWAS Catalog 34 , Neale lab UK Biobank summary statistics (http:// www.nealelab.is/uk-biobank/), SAIGE UK Biobank summary statistics 35 , and FinnGen Summary statistics 36 .

Candidate genes
We obtained nearby gene's islet expression specificity index (iESI) deciles 37 . iESI deciles indicate the extent to which genes are both highly expressed in islets as well as the specificity for islet expression versus ubiquitous expression across other tissues, with values near zero representing genes that have low islet specificity or low expression in islets and values near 10 representing genes whose expression is highly specific to islets. We define high iESI genes as those with a decile above 7. We consolidated gene labels across sources using Entrez gene symbols.
Next, we performed colocalization of proinsulin signals with two eQTL datasets. First, a human islet RNA-seq-based eQTL study from the InsPIRE consortium (n=420) 38 , which reported significant eQTL for 4,312 genes (FDR <1%), and second, a subcutaneous adipose tissue RNA-seq study from 434 Finnish men in the METSIM study 39 , which reported at least one significant eQTL at 9,687 genes (FDR <1%). We used LD and HyPrColoc to test for colocalizations with genes within 1 Mb of each lead proinsulin variant; as described in the previous section, we used a multi-study framework with proinsulin, European-ancestry DIAMANTE 24 , MAGIC glycemic traits 27 , and one eQTL gene at a time, as well as testing with only proinsulin and each gene. We considered the signals colocalized if HyPrColoc posterior probability for colocalization (PPFC) scores were >0.6 and either the sensitivity score was >0.4 or LD r 2 > 0.8. We plotted signals using LocusZoom 40 . Additionally, we performed summary Mendelian randomization (SMR) 41 to begin assessing potential causal relationships by using the genetic variants as an instrumental variable to test for the causative effect of gene expression on proinsulin. To account for multiple hypothesis testing, we used a Bonferroni-corrected significance threshold. To evaluate evidence of pleiotropy from linkage between two distinct causal variants, we ran heterogeneity in dependent instruments (HEIDI) as part of the SMR analysis.

Identification of extended credible set variants
We determined 99% credible sets using regions ± 500 kb around each lead variant, using the following equation for Bayes factors: where β and SE are the effect sizes and standard errors from the meta-analysis 42 . For loci with multiple significant signals, we used the approximate conditional analysis option in GCTA, using eMERGE as the reference panel, to define credible sets. Variants with a low posterior probability are less likely to be causal; however, variants that are not represented or poorly represented in the meta-analysis may erroneously be excluded from consideration as a putative causal variant. We therefore extended the credible set to include all variants in high LD (r 2 > 0.8 in 1000 Genomes European) with the lead variant. This approach recognizes variants that are not included in the meta-analysis due to analytic or technical factors (e.g. indels are not imputed by HRC and variants with MAF < 0.5%), as well as variant that are poorly represented in our meta-analysis due to factors such as low sample size.

Coding and regulatory elements
To identify potential candidate genes for each signal, we considered protein-coding genes within ~100 kb of the signal's lead variant 43 , with special attention to genes for which a coding variant is included in a signal's extended credible set and those that are highly and specifically expressed in islets. To identify genes through coding effects, we obtained annotation for all variants in our extended credible set using Variant Effect Predictor (VEP) 44 , Sorting Intolerant from Tolerant (SIFT) 45 , PolyPhen-2 46 , Combined Annotation Dependent Depletion (CADD) 47,48 , and MutationAssessor 49 . For all functional predication tools, we selected default thresholds.
We tested proinsulin signals for regulatory element enrichment using the following epigenomic annotations: chromatin states in islets, adipose, and skeletal muscle 50 , bulk ATAC-seq peaks 38,51 , islet sn-ATAC cluster peaks 52 , and other islet chromatin annotations 53 . We used the Genomic Regulatory Elements and GWAS Overlap algorithm (GREGOR) to evaluate global enrichment of proinsulin-associated variants in epigenomic regulatory features 54 . GREGOR observes the signal overlap in annotated regulatory data among lead GWAS variants or their LD proxies (r 2 > 0.8) relative to expected overlap-based control variants matched to index variants for number of variants in LD, minor allele frequency, and distance to nearest gene. Transcriptional reporter assays-To test for allelic differences in transcriptional activity, we performed dual-luciferase reporter assays as previously described 55 . We used genomic DNA of individuals homozygous for the reference or alternate alleles to amplify fragments surrounding rs10501320, cloned amplicons into the firefly luciferase reporter vector pgL4.23 (Promega, Madison, WI), and sequence-confirmed five purified clones for each allele, in each orientation (Azenta, Research Triangle Park, NC); alleles at additional variants within each amplicon were kept consistent (

Identification of proinsulin association signals
We identified 28 loci associated at genome-wide significance (p-value < 5x10 -8 ) with proinsulin adjusted for BMI, including 16 loci >500 kb away from a previously-reported proinsulin association (Table 1, Table S4, Figures S1-S2). Combined, the 28 lead variants explained an estimated 8.9% of the total proinsulin variance in the meta-analysis, with the estimated percent of trait variance explained by each variant ranging from 2.1% (STARD10) to 0.07% (JARID2).
Association results for fasting proinsulin without BMI adjustment yielded results similar to those obtained in the BMI-adjusted analysis (Pearson correlation of effect estimates = 0.97; Figure S3, Table S5). Variants at two additional loci, SLC2A10 and BCL11A, which narrowly missed the significance threshold in the analysis with BMI adjustment (p-value = 6x10 -8 and 1.5x10 -7 , respectively) attained genome-wide significance in the analysis without BMI adjustment (Table 1).
We performed subsequent approximate conditional analysis and identified six additional signals at genome-wide significance located within 500 kb of the lead variant of five known proinsulin loci near STARD10, MADD, PCSK1, SGSM2, and DDX31 (Table 2, Table S6, We identified three previously-reported signals near MADD, including one signal that consists of a proinsulin-associated 10 nonsense variant (rs35233100) that is now genomewide significant after conditioning on the lead signal (rs10501320). Both the primary and secondary signals at the SGSM2 locus have been previously reported 5,10,11 . We also identify secondary signals located near STARD10, PCSK1, and DDX31. At DDX31, although both signals (rs368476 and rs7864386) were within 50 kb of the previously-reported femalespecific DDX31 signal (rs306549) 11 , neither was in high LD with the previously-reported lead variant (r 2 < 0.1, Figure S5) 5 , validating the DDX31 locus, but not the previouslyreported signal. For subsequent analyses, unless otherwise stated, we included the 28 primary signals and six conditionally distinct signals for proinsulin adjusted for BMI, as well as the two signals for proinsulin not adjusted for BMI, for a total of 36 signals at 30 loci.
This meta-analysis replicated four low-frequency (MAF < 0.05) proinsulin-associated signals originally identified in an exome array analysis of Finnish participants in the METSIM exome study 10 (Table S7, Figures S6-S7). We validated missense or nonsense lead variants in TBC1D30, SGSM2, and MADD; all of which were genome-wide significant in the meta-analysis even after excluding METSIM. The signal at the KANK1 locus was only genome-wide significant in the full meta-analysis (lead variant rs146375546, p-value = 4.3x10 -11 ), as the lead variant is rare in general European-ancestry populations but enriched in Finnish ancestry populations (1000G MAF = 0.003 in 1000G European ancestry populations vs 0.015 in the Finnish population). The replications of associations at the four low-frequency variants highlight the utility of exome arrays in finding low-frequency variants and the challenges in replicating variants that are not equally represented across populations.
We obtained the direction of allelic effect of the 30 lead proinsulin leads on fasting glucose 27 and more than 30 other related glycemic traits including proinsulin levels after an oral glucose challenge 10 ( Figure 1, Tables S2, S10). The allele associated with higher glucose was associated with higher proinsulin for half the lead variants (15 of 30) and associated with lower proinsulin for the other half.

Putative candidate genes
To identify potential candidate genes for each signal, we identified nearby genes, obtained their iESI deciles, and performed colocalization and SMR analyses with eQTL data (Tables S11-S14) 38,39 . Genes with high expression levels in islets, particularly those that are not highly expressed in other tissues, represent strong candidate genes for influencing the proinsulin to insulin processing pathway. These genes that are highly and specifically expressed in islets will have a high iESI values (defined as iESI decile > 7) 37 . Most (29/36) proinsulin signals fell within 100 kb of at least one gene with a high iESI (Table S11).
Top iESI genes included well-documented beta-cell genes such as MADD, PCSK1 and PCSK2 56-58 , as well as genes at loci not previously described in glycemic trait studies: ELAPOR1 and SLC7A14.
To identify additional candidate genes underlying the proinsulin association signals, we colocalized them with eQTL signals 38,39 (Table S12- 13).Through colocalization with eQTL in pancreatic islets from the InsPIRE consortium 38 , we identified 11 proinsulin signals that colocalized with eQTL signals for 17 genes (Table S12); six proinsulin signals colocalized with eQTL for more than one gene. The alleles associated with higher proinsulin were associated with higher expression of eight genes (MADD, RNF6, CDK8, SLC2A10, SNX7, ARAP1, STARD10, and TCF7L2), and lower expression of nine protein coding genes or noncoding transcripts (SIX3, SIX2, RP11-89K21.1, AC012354.6, ARSG, WIPI1, SLC7A14, FAM46C, and LARP6). All 17 colocalizations also passed the experiment-wide significance threshold for SMR (p-value < 0.0029). Using HEIDI, we detected heterogeneity for just 1 gene at p-value < 0.0029: STARD10. While this may indicate the correlation is due to linkage rather than pleiotropy, the result may also be due to the complicated structure of this locus, which may violate the assumption of only one causal variant in the eQTL region.
Signal colocalization at the NKX6-3/ANK1 locus provided additional data with which to interpret this complex locus. The locus includes two T2D signals 24,26 : one colocalized with the NKX6-3 eQTL in islets 24 and the other colocalized with an ANK1 eQTL in adipose and muscle 26,59 .
NKX6-3 is highly and specifically expressed in islets (iESI decile = 10), while ANK1 is not (iESI decile = 2). The T2D risk alleles for the two signals were associated with lower islet NKX6-3 expression and higher ANK1 expression in adipose and muscle, suggesting that the signals affect T2D risk in different tissues. We observed only one proinsulin association signal at this locus. While we might have expected it to align with the proposed islet NKX6-3 eQTL signal, it instead colocalized with the adipose ANK1 eQTL signal ( Figure   2, Figure S8, Table S13). The proinsulin lead variant rs13266210 is in strong LD with the ANK1 eQTL (rs3802315, r 2 = 0.84) and the East Asian AGEN T2D lead variant (rs62508166, r 2 = 0.92), and HyPrColoc shows strong evidence of colocalization across all three studies (PPFC = 0.92). The A allele of rs13266210 is associated with increased T2D risk, higher ANK1 expression in adipose, and lower proinsulin. At this proinsulin signal, proxy variant rs6989203 (LD r 2 = 0.84 with rs13266210) overlaps with an islet beta-cell single nucleus ATAC peak 52 and is in high LD with the ANK1 eQTL site (r 2 = 0.93). Of the two T2D signals at the ANK1/NKX6-3 locus previously proposed to act in different tissues on different genes, the proinsulin signal colocalizes with the adipose ANK1 signal, versus the expected colocalization with islet NKX6-3.

Credible sets and variant annotation and function
We built a credible set of putative causal variants for each of the 36 signals. These 36 sets together contained 814 variants (Table S15). We extended the credible sets to include 276 additional variants exhibiting LD r 2 ≥ 0.8 (1000 Genome European-ancestry reference) with the lead variants, including 142 variants that were unavailable in the meta-analysis and therefore could not have been included in the Bayesian credible set. Three signals had one variant in the extended credible set (SGSM2, ELAPOR1, and the second signal in DDX31) and 14 signals (39%) had ten variants or fewer.
The extended credible sets for 17 proinsulin signals contained coding variants (Table S16). Across all credible sets, we observed 1 nonsense, 18 missense, and 31 synonymous variants. The credible sets for 13 proinsulin signals contained at least one missense variant: seven signals in previously-identified proinsulin loci (TBC1D30, PCSK1, KANK1, FAM185A, the first and second signals at SGSM2, and the third signal in MADD), four in loci known in other glycemic trait GWAS (SLC30A8, GIPR, FAM46C, and PAM), and two that are not known proinsulin or glycemic trait genes (ELAPOR1 and WIPI1).The lead variant rs74920406 at the ELAPOR1 locus, a missense variant of low-frequency (p.His55Tyr, MAF = 0.04), was not previously associated with proinsulin or other glycemic traits but was associated with LDL (Table S17) 60 . This variant is conserved across species 48,61,62 and has a probably damaging effect on the protein 46 . ELAPOR1 encodes endosome-lysosome associated apoptosis and autophagy regulator 1 and inhibits beta-cell insulin signaling by accelerating recycling of the insulin receptor and insulin-like growth factor receptors 63 . The credible set for WIPI1 contained a coding missense variant (p.Thr31Ile; rs883541). WIPI1 is a phosphatidylinositol-2-phosphate effector gene, which encodes a component of the autophagy machinery; skeletal muscle from severely insulin resistant patients with T2D displayed decreased expression of autophagy-related genes, including WIPI1 64 .
Among the 1,090 variants in the extended credible sets for all signals, 62 overlapped with an active enhancer in islets and 76 overlapped with an islet cell type single-nucleus ATAC-seq peak (Table S18). We thus examined regulatory annotations of proinsulin-associated credible sets. The variants were enriched in islet active enhancers ( p-value = 4.6 x10 -12 ). Among islet single-nucleus ATAC-seq peaks, beta cell peaks were most enriched (fold enrichment = 2.9, p-value = 5.1 x10 -10 ).
To further investigate plausible allelic effects of one variant located in an annotated ATAC-seq peak, we examined the regulatory function of lead variant rs10501320, at MADD, in transcriptional reporter assays. MADD is a well-documented proinsulin locus associated with proinsulin-to-insulin conversion 65 . Compared to a negative control, a genomic fragment spanning rs10501320 and the surrounding ATAC-seq peak showed ~3fold increased transcriptional activity in rat insulinoma 832/13 cells and a ~4-fold increase in transcriptional activity in mouse insulinoma MIN6 cells, consistent with a role as an enhancer (Figure 3, Figure S9). The rs10501320-G allele showed 1.3 to 1.6-fold greater transcriptional activity than the C allele (p-value <0.0001); the G allele was associated with higher proinsulin in this GWAS meta-analysis and higher fasting glucose previously 27 .
The direction of effect was consistent with the MADD nonsense mutation rs35233100, which has been predicted to cause a loss of function and was associated with decreased proinsulin ( Figure S9). These data suggest that rs10501320 may contribute to allele-specific differences in MADD transcriptional activity in islets. The direction of effect was consistent with the MADD nonsense mutation rs35233100, which has been predicted to cause a loss of function and was associated with decreased proinsulin (Figure S9) 10 . These data suggest that rs10501320 may contribute to allele-specific differences in MADD transcriptional activity in islets and further suggest that MADD is a causal transcript at this multigene locus 10,66 .

Discussion
These genetic analyses of circulating proinsulin levels, based on large GWAS meta-analyses, identified 36 signals at 30 loci. We identified 12 previously-reported proinsulin loci and 18 additional proinsulin loci. We replicate associations with low-frequency variants at TBC1D30, SGSM2, and MADD, loci that had previously been reported in an exome array analysis in a single cohort 10 . The only previously-described proinsulin locus that our study did not replicate was one reported as a cohort-specific signal near SV2B (p-value = 0.17) 11 . Characterization of these loci through eQTL colocalization, coding and regulatory annotation, and nearby gene function (Tables S11-S14) provided candidate genes that may influence insulin processing and secretion.
Understanding how glycemic trait signals influence proinsulin can help elucidate potential pathways by which the variants may ultimately influence T2D. We identified five plausible broad groups of encoded proteins: prohormone convertases, beta-cell transcription, G-protein modulators, regulation of cytoskeleton dynamics, and lysosomal maturation/ endosome recycling (Tables S11, S14). In the first group we include genes PCSK1 and PCSK2 encoding the prohormone convertases PCSK1/3 and PSCK2 that are respectively responsible for cleaving the B-Chain and A-Chain from the C-Peptide during proinsulin processing to insulin. While targeted studies have implicated an association between genetic variants in PCSK2 and glucose homeostasis and T2D 67 , the association had not yet reached significance in a GWAS with T2D or other glycemic traits, and one study had suggested that PCSK2 did not significantly impact the beta cells' ability to produce mature insulin 68 . We now demonstrate that the association reaches genome-wide significance in proinsulin, supporting a significant role for PCSK2 in beta cells during the processing of proinsulin to insulin. The second group includes candidate genes implicated in beta-cell differentiation (BARHL1 at the DDX31 locus, JARID2, NKX6-3, SIX2, and SIX3) or the  activation and maintenance of beta-cell transcription (BCL11A, C2CD4B, TCF7L2, and  TLE1). For example, JARID2 has been shown to play a role in pancreatic and endocrine cell differentiation and beta-cell mass in mouse embryos [69][70][71] . The third group consists of genes mediating vesicle translocation and membrane fusion events by affecting the activity of small G proteins, such as Rab and Rho GTPases. DLC1, at the DLC1 locus, encodes a GTPase activating protein that promotes actin polymerization through regulating the Rho/Rock1 and is modulated by insulin-responsive pathways 72,73 . The three remaining loci in this group are established proinsulin loci whose nearby genes have been described previously (MADD, SGSM2, and TBC1D30) 10 . The fourth group is comprised of genes affecting the cytoskeleton, which undergoes dynamic changes during the processing and secretion of proinsulin at basal and stimulated states: ANK1, KANK1, LRRC49, and RNF6. KANK1 promotes exocytotic events by mediating actin polymerization 74 ; LRRC49 at the LARP6 locus is a member of the tubulin polyglutamylase complex 75 ; and RNF6 is an E3 ubiquitin-protein ligase that regulates actin remodeling 76,77 . Finally, the fifth group includes genes (ELAPOR1, SNX7, STX16, TPD52, WIPI1, and ARSG) implicated in endosome recycling and lysosomal maturation. In the beta cells, proinsulin is degraded in autophagosome-derived lysosomes via an endocytotic pathway that contributes to the tight regulation of insulin secretion and glucose homeostasis 78,79 . Both SNX7 (encoding a sorting nexin 80 ) and WIPI1 (encoding a WD40 repeat protein) play a role in forming autophagosome and transiting autophagosome to early endosome 81,82 . STX16 encodes a t-SNARE involved in secretory vesicle membrane fusion and endosome recycling in the Golgi 83,84 . These genes might help further elucidate the mechanisms for insulin synthesis, processing, and secretion.
Previously proposed clusters of T2D loci included two related to insulin deficiency that differed based on the direction of effect of the T2D risk allele on circulating proinsulin levels [6][7][8][9] . The allele associated with higher glucose was associated with higher proinsulin for half the lead variants, including all variants located near genes involved in beta-cell dysfunction and transcriptional regulation (Tables S10, S11, S14). For the remaining proinsulin loci, the alleles associated with higher glucose were associated with lower proinsulin; many of these variants are located near genes involved in cytoskeleton dynamics, lysosomal maturation, or endosome recycling (e.g. WIPI1, ELAPOR1, and RNF6). Thus, the directions of allelic effect on proinsulin relative to glucose can help distinguish between clusters of T2D loci [6][7][8][9] .
As another approach to identify potential causal genes, we integrated GWAS signals with islet eQTLs through colocalization and SMR analyses. This approach identified four potential candidate genes at three loci that that have not previously been reported in proinsulin or any of the T2D and glycemic studies: SLC2A10, SLC7A14, WIPI1, and ARSG. Loci that colocalized with eQTL signals of more than one gene, such as SIX3 and WIPI1, could correspond to allelic effects on more than one gene, sequential effects, or effects on both genes for which only one gene is physiologically relevant to the trait. Our eQTL colocalization analyses also showed that the proinsulin signal at the NKX6-3/ANK1

Europe PMC Funders Author Manuscripts
Europe PMC Funders Author Manuscripts locus does not colocalize with the primary AGEN T2D signal and NKX6-3 in islets, but rather with the secondary AGEN T2D signal and the ANK1 eQTL in adipose 26,38,39 . Larger eQTL datasets and further characterization of their conditionally distinct signals may be valuable to better interpret colocalization with GWAS signals. Together, the several GWAS traits and eQTL colocalizations at this locus suggest that the underlying mechanisms are not yet fully understood. While we attempt to offer plausible candidate genes for all our proinsulin signals, the genes identified through physical proximity to the lead variant, coding variants in the credible set, islet expression, and literature searches (Table S11-14) are predictions; functional work is invaluable to elucidate genes' roles in the proinsulin.
The SIX3 proinsulin locus was described previously as a T2D and glucose signal in East Asians 26,27,85 . Both SIX3 and SIX2 are highly and specifically expressed in islets, with an iESI score of 10 for both genes. SIX3 regulates beta cell development coordinately with SIX2, and knockdown of either gene impairs insulin secretion 86,87 . Despite a common allele frequency (MAF > 0.13 for all 1000 Genomes ancestries) across ancestries and evidence that the lead variant affects transcriptional factor binding and transcriptional activity 85 , GWAS meta-analyses of T2D and fasting glucose have failed to date to identify an association at p-value<5x10 -8 in European-ancestry individuals 24,27 . Our proinsulin results demonstrate that the glycemic associations at this SIX3 signal are not specific to East Asians ( Figure   S10).
The primary STARD10 signal, which colocalized with a T2D [24][25][26] signal, also colocalized with both the STARD10 and ARAP1 lead islet eQTL signals ( Figure S11). The proinsulindecreasing allele at the STARD10 lead variant (rs77464186) was associated with decreased expression of both STARD10 and ARAP1. Although the strength of association was stronger with STARD10 expression (eQTL p-value with rs77464186 for STARD10 expression = 5x10 -34 vs. ARAP1 expression = 6x10 -7 ), the evidence for colocalization was stronger with ARAP1 (ARAP1 r 2 = 0.99, Posterior Probability of Full Colocalization, PPFC = 0.9) vs. STARD10 (r 2 = 0.93, PPFC = 0.60). Both STARD10 and ARAP1 are highly expressed in islets, with iESI scores of 9 and 7, respectively. The strength and direction of association between proinsulin and STARD10 were consistent with the evidence that STARD10 influences insulin granule biosynthesis and insulin processing by binding to phosphatidylinositides; beta cell deletion of Stard10 in mice led to impaired insulin secretion while overexpression of Stard10 improved glucose tolerance in high fat-fed animals 88,89 .
Approximate conditional analysis software such as GCTA requires use of a large LD reference panel representative of the study participants. Even among single-ancestry analyses like this European-only proinsulin meta-analysis, use of different LD reference panels of the same broad European ancestry can result in strikingly different signals. This issue is particularly noticeable in regions with at least one strongly significant signal. For example, at the MADD locus (p = 1.4x10 -165 ), GCTA analyses identified nine, twelve, or twenty-two conditionally distinct signals, depending on which reference panel we employed (Table S6). The discrepancy in results led us to report a signal only when we observed it in at least two of three reference panels, reducing the total number of signals in the MADD locus to three -all of which had been previously reported to be associated with proinsulin, adding further confidence to the validity of these signals. While identifying conditionally distinct signals using meta-analysis summary results is invaluable, caution in interpretation of signals is warranted.
To identify potential causal variants driving our observed signals that would have been missed in the regular credible sets built by the Bayesian fine-mapping approach from the association results alone, we defined an extended credible set as the union of variants in the Bayesian credible set and variants in high LD with the lead variant (r 2 > 0.8 in 1000 Genomes European). This approach recognizes that standard fine-mapping approaches may be mis-calibrated when applied to meta-analyses 90 , and that variants may have been excluded from the meta-analysis due to analytic or technical factors (e.g. indels are not imputed by the Haplotype Reference Consortium or variants with MAF < 0.5%), as well as variants that were poorly represented in our meta-analysis due to factors such as low sample size. The extended credible set approach added 276 variants, including 142 variants that were not included in the meta-analysis and therefore could not have been included in the Bayesian credible set. The extended credible set identified an additional missense variant in PCSK1 (rs6234), 15 variants that overlap active enhancers in islets, and 24 variants that overlap islet single nucleotide ATAC-seq cluster peaks. The extended credible sets provide a more comprehensive pool of candidate variants for mechanistic studies.
Integration of proinsulin loci with complementary glycemic traits, expression data in trait-relevant tissues, and functional follow-up provide candidate genes for T2D and hypotheses on potential avenues of mechanism for known T2D loci. While these proinsulin meta-analyses include a large sample size, the difficulty and cost in obtaining proinsulin measurements limits the sample size compared to studies of many other glycemic traits. Future research into genetic contributors to proinsulin will benefit from more and more diverse cohorts. Nonetheless, these findings may help accelerate our understanding of T2D disease pathology and promote translation into new therapeutics. A) regulatory element enrichment analyses using enhancers, accessible chromatin, and other data from islets, skeletal muscle, and adipose. Proinsulin variants are enriched in islet active enhancers and accessible chromatin, especially in beta cells. B) The MADD locus in proinsulin, lead variant rs10501320. The MADD region is an area of extensive LDthe full locus is shown in Figure S3. C) The lead variant of the primary MADD signal is located in an intron of MADD and is in accessible chromatin in islets and an enhancer state and a region conserved across species. D) A 411-bp genomic element spanning the lead variant rs10501320 showed strong enhancer activity in a transcriptional reporter assay in two beta cell lines: MIN6 and 832/13. EV: empty vector; G/C: alleles at the lead variant rs10501320. In the eQTL and GWAS data, the G allele at rs10501320 that showed higher